Tag
Introduces Mask-Proof, an LLM-based pipeline that converts mathematical proofs into masked-step tasks for automated evaluation, and presents MaskProofBench, a benchmark of 292 curated problems achieving 96.8% agreement with expert annotators.
This paper evaluates the robustness of proof autoformalization models in Lean 4 under global and local perturbations, finding that current LLM-based models are sensitive to perturbations and often fail to faithfully reflect local changes.
This paper introduces a strict step-level verification framework for evaluating research-level mathematical proofs using LLMs, addressing context poisoning and outperforming global evaluation. The approach shifts focus to deductive constraints and reveals that remaining errors are often due to pedantic hyper-rigor, exposing implicit ambiguities in benchmarks.
OpenAI submitted proof attempts for the First Proof challenge, a research-level math competition testing whether AI can produce correct, checkable proofs. The company's internal model successfully solved at least five of the ten problems, demonstrating significant progress in sustained reasoning and rigorous mathematical thinking.