Autoformalization: Comparative Overview

Side-by-side comparison of autoformalization papers across agents, tools, datasets, and evaluation methods.

Input & Output

Paper Input Output
APOLLO Formal Statement Formal Proof
Minimal_Agent Formal Statement Formal Proof
Agentic Proof Automation Formal Statement Formal Proof
Aristotle Formal Statement Formal Proof
ATF NLP Statement Formal Statement
Textbook Formalizer NLP Document Formal Statement + Formal Proof
Ax-Prover Formal Statement Formal Proof
CriticLean NLP Statement Formal Statement
Designing Agentic Tool (proposal) NLP Statement Formal Statement + Formal Proof
HILBERT Formal Statement Formal Proof
Lifelong Learner Formal Statement Formal Proof
APRIL Formal Proof Formal Proof
M2F NLP Document Formal Statement + Formal Proof
MathCode NLP Statement Formal Statement + Formal Proof
MerLean NLP Document Formal Statement + Formal Proof
Numina-Lean-Agent Formal Statement Formal Proof
Process-Driven Autoformalization NLP Statement + NLP Proof Formal Statement + Formal Proof
ProofBridge NLP Statement + NLP Proof Formal Statement + Formal Proof
Seed-Prover Formal Statement Formal Proof
Delta Prover Formal Statement + NLP Statement Formal Proof
TheoremLlama Formal Statement + NLP Statement + NLP Proof Formal Proof

Agents Used

Paper Formal Prover Nlp Prover Formalize Statement NLP=FL Semantic Check Break to Lemma Orchestrator Formalizer (proof+statement) Other
APOLLO · · · · · · Proof Assembler
Minimal_Agent · · · · · · Memory Manager
Agentic Proof Automation · · · · · ·
Aristotle · · · · Geometry Solver Proof Search Algorithm
ATF · · · · ·
Textbook Formalizer · · · · Engineering Reviewer Maintainer Progress Scanner Triage
Ax-Prover · · · · ·
CriticLean · · · · ·
Designing Agentic Tool (proposal) · · · · Nlp Proof Reviewer Code Generation
HILBERT · · · · Proof Assembler Subgoal Extractor
APRIL · · · · · ·
M2F · · · Statement Extractor
MathCode · · · · Planner
MerLean · · · · · Statement Extractor Autoinformalizer
Numina-Lean-Agent · · · Nlp Proof Reviewer
Process-Driven Autoformalization · · · · · · Rank Formalizations
ProofBridge · · · · ·
Seed-Prover · · · ·
Delta Prover · · · ·
TheoremLlama · · · · · · Autoinformalizer

Tools & MCPs Available

Paper Lean 4 MCP Lean Search File System Code Execution Other
APOLLO · · · Auto Solver Sorrifier Syntax Refiner
Minimal_Agent · ·
Agentic Proof Automation · ·
Aristotle · · · Yuclid
ATF · · ·
Textbook Formalizer · Bash Tool Git Tools Issue Tracker
Ax-Prover ·
CriticLean · · ·
Designing Agentic Tool (proposal) · · Web Search
HILBERT · ·
APRIL · · ·
M2F · · ·
MathCode · Anti Trivialization Guard
MerLean ·
Numina-Lean-Agent · ·
Process-Driven Autoformalization · · ·
ProofBridge · ·
Seed-Prover ·
Delta Prover · · Custom Lean4 Dsl
TheoremLlama · ·

Datasets Used

Paper MiniF2F PutnamBench FATE Textbook Research Papers Combibench Imo Other
APOLLO · · · · · Proofnet
Minimal_Agent · · · · LeanCat
Aristotle · · · · · Ag 30
ATF · · · · · · Formalmath Lite Proverbench
Textbook Formalizer · · · · · ·
CriticLean · · · · · · · Criticleanbench Omni Math
HILBERT · · · · ·
APRIL · · · · · · · APRIL
M2F · · · ·
MerLean · · · · · ·
Process-Driven Autoformalization · · · · · · · Mathlib Theorems MATH GSM8K
ProofBridge · · · · · ·
Seed-Prover · · · Erdős
Delta Prover · · · · · ·
TheoremLlama · · · · · ·

Statement Formalization Evaluation

Paper Evaluation Method
ATF For the human evaluation, three domain experts independently review 100 randomly sampled instances from each benchmark, with the majority opinion serving as the final performance judgment.
Textbook Formalizer Post-run, an interactive blueprint website enables human community verification side-by-side. No quantitative metric for statement accuracy is reported.
CriticLean Two-step gated verification: Lean 4 compiler checks syntactic validity, then CriticLeanGPT performs deep semantic check. Human evaluation on 50 Omni-MATH problems showed the critic feedback loop boosted accuracy from 38.0% (single pass) to 84.0%.
M2F Human domain experts compare the generated Lean declarations directly against the original, source-linked text spans.
MathCode Two-layer check: (1) hardcoded anti-trivialization guard rejects statements reduced to True/False, (2) LLM-based semantic evaluator grades fidelity A-D against original NLP problem on 5-point checklist (objects, quantifiers, hypotheses, conclusions, multi-part coverage). Optional strict A+ recheck and double-check mode with cross-model agreement.
MerLean Manual review to ensure all new definitions and axioms are mathematically accurate and rigorously constructed, using Autoinformalizer.
Process-Driven Autoformalization Human Lean 4 experts manually evaluated the outputs.
ProofBridge Prove equivilance in Lean4 restricted tactic set with Gemini-2.5-Pro.