MathCode / AUTOLEAN | Arshia Soltani Moakhar

Method

MathCode is a terminal-based AI coding assistant with a built-in math formalization engine called AUTOLEAN. Given a math problem in plain natural language (as a JSON file with a uuid and problem field), it automatically converts it into a Lean 4 formal theorem statement and optionally produces a formal proof. The system is a fixed pipeline (not a dynamic multi-agent orchestrator), where each stage acts as a distinct agent.

The core formalization pipeline proceeds as follows:

Thinking/Planning Agent (Phase 5.2): Runs only on the first iteration. Analyzes the NLP problem, derives a proof strategy, identifies exact Mathlib lemma names likely to be needed, and flags coercion/typeclass pitfalls. Uses a thinking-optimized model with optional web search. Outputs structured planning notes without writing any Lean code.
Formalizer/Coder Agent (Phase 5.3): Takes the planning notes and the original JSON problem to generate a complete Lean 4 file containing the formalized theorem statement (and optionally a proof). On repair iterations, it additionally receives the previous Lean file, compiler error output, and an error memory of recurring failures.
Anti-Trivialization Guard: A deterministic code-level check (not an LLM call) that parses the theorem header and rejects any statement where the main proposition is trivially True or False. This prevents the LLM from replacing the real problem with a vacuous statement that compiles easily.
Lean 4 Compiler: Executes lake env lean against the generated .lean file within a Mathlib-configured workspace. If compilation fails, errors are fed back to the Formalizer in a compile-repair loop (up to 6 iterations by default).
Semantic Evaluator: After successful compilation, a separate LLM grades the formalization’s fidelity against the original NLP problem on a 5-point checklist (core objects/domains, quantifier structure, hypotheses, conclusions, multi-part coverage). Formalizations scoring below the threshold (default B) are rejected and retried with evaluator feedback.

For proof completion, two separate scripts extend the pipeline:

Proof Planner: Generates proof strategies for formalized statements. Replans after a configurable number of failed attempts, incorporating the latest failure report.
Proof Prover: Replaces sorry placeholders with actual proofs. Multiple independent attempts run in parallel, each with its own compile-repair loop. Rejects outputs that still contain sorry/admit or introduce unauthorized declarations.

The system supports multiple LLM backends (OpenRouter API, Codex Exec, Claude CLI), optional web search for mathematical references, and parallel workers across problems.

Statement Fidelity Checking

Since MathCode produces the formal statement itself, it has two mechanisms to prevent trivialized or incorrect formalizations:

Anti-Trivialization Guard (Hardcoded Policy): Detects and rejects statements where the top-level proposition is literally True or False. Only catches the most blatant trivializations.
Semantic Evaluator (LLM-based Grading): Applies a detailed 5-point checklist to grade fidelity A-D. Grade D for trivialized/vacuous statements, C for missing major obligations, B for minor issues, A for fully faithful. An optional strict A+ recheck distinguishes exact matches from broadly faithful ones. A double-check mode can require cross-model agreement (e.g., both Gemini Flash and GPT-5.2 must assign grade A).