Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing

arXiv cs.AI 06/29/26, 04:00 AM Papers
geometry problem-solving autoformalization theorem-proposing neuro-symbolic machine-learning ai-research
Summary
This paper introduces SD-GPS, a solver-driven framework for geometry problem solving that uses autoformalization guided by solver feedback and verified theorem proposing to overcome bottlenecks in neuro-symbolic systems.
arXiv:2606.27926v1 Announce Type: new Abstract: Geometry Problem Solving have increasingly adopt the neuro-symbolic paradigm, combining neural intuition with symbolic rigor. However, current frameworks suffer from severe bottlenecks in two core stages: autoformalization, which treats multimodal translation as a static task decoupled from downstream solver compatibility, and theorem prediction, where solvers frequently hit a deductive impasse due to fixed rule libraries. To address these, we propose SD-GPS, a solver-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction. First, Solver-Driven Autoformalization unifies supervised formal-language adaptation and solvability-guided reinforcement learning into a single module built on QwenVL3-2B, making executability the central training signal. Second, Verified Theorem Proposing introduces an impasse-aware agent that proposes local auxiliary lemmas from current proof states, ensuring soundness by filtering all proposals through symbolic verification. Empirical evaluations on Geometry3K and PGPS9K demonstrate that SD-GPS consistently outperforms existing MLLM, neural, and neuro-symbolic methods across standard completion, multiple-choice, and cross-modal reference regimes, proving that closing the loop between multimodal perception and symbolic execution significantly improves geometric reasoning, offering profound insights into how neural agents can be grounded by formal systems to achieve verifiable problem-solving capabilities.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:27 AM
# Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing
Source: [https://arxiv.org/html/2606.27926](https://arxiv.org/html/2606.27926)
###### Abstract

Geometry Problem Solving have increasingly adopt the neuro\-symbolic paradigm, combining neural intuition with symbolic rigor\. However, current frameworks suffer from severe bottlenecks in two core stages: autoformalization, which treats multimodal translation as a static task decoupled from downstream solver compatibility, and theorem prediction, where solvers frequently hit a deductive impasse due to fixed rule libraries\. To address these, we propose SD\-GPS, a solver\-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction\. First, Solver\-Driven Autoformalization unifies supervised formal\-language adaptation and solvability\-guided reinforcement learning into a single module built on QwenVL3\-2B, making executability the central training signal\. Second, Verified Theorem Proposing introduces an impasse\-aware agent that proposes local auxiliary lemmas from current proof states, ensuring soundness by filtering all proposals through symbolic verification\. Empirical evaluations on Geometry3K and PGPS9K demonstrate that SD\-GPS consistently outperforms existing MLLM, neural, and neuro\-symbolic methods across standard completion, multiple\-choice, and cross\-modal reference regimes, proving that closing the loop between multimodal perception and symbolic execution significantly improves geometric reasoning, offering profound insights into how neural agents can be grounded by formal systems to achieve verifiable problem\-solving capabilities\.

## 1Introduction

Geometry Problem Solving \(GPS\) aims to derive mathematical solutions from a textual problem description and its corresponding diagram\(Chenet al\.,[2022](https://arxiv.org/html/2606.27926#bib.bib35); Gelernteret al\.,[1960](https://arxiv.org/html/2606.27926#bib.bib41); Sachan and Xing,[2017](https://arxiv.org/html/2606.27926#bib.bib42); Wu,[1986](https://arxiv.org/html/2606.27926#bib.bib43); Wuet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib2); Penget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib5)\)\. To tackle the inherent complexity of geometry, many recent systems adopt the neuro\-symbolic paradigm\(Trinhet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib22); Chervonyiet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib39)\)\. This dual\-process approach utilizes neural networks as a "perceptual front\-end" and symbolic engines as a "logical back\-end"\(Wuet al\.,[2022a](https://arxiv.org/html/2606.27926#bib.bib28)\)\. Within this framework, the reasoning pipeline is driven by two core modules: \(i\) Autoformalization, which translates multimodal raw inputs into executable formal representations, and \(ii\) Theorem Prediction, which selects or proposes theorem instances that can be used by the solver to derive solutions\. This modular architecture allows systems to leverage the intuitive recognition of deep learning alongside the deterministic rigor of formal logic\.

In the autoformalization stage, the objective is to transform multimodal inputs into solver\-ready predicates\. However, existing GPS frameworks typically adopt a “decoupled\-and\-rectified” paradigm\(Luet al\.,[2021](https://arxiv.org/html/2606.27926#bib.bib1); Zhaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib9); Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10)\), in which isolated neural modules independently parse diagrams and formalize text, followed by a post\-hoc rectification step to resolve referential ambiguities\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib9); Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10)\)\. Beyond this structural fragility, a more fundamental limitation lies in objective misalignment\. Current approaches treat formalization as a static, one\-way semantic translation task, optimizing against expensive human\-annotated logical forms that emphasize linguistic fidelity rather than computational solvability\. As a result, the formalization process is not aligned with the ultimate goal of solver\-based problem solving, and high semantic accuracy does not necessarily translate into effective or executable representations for downstream reasoning\.

Even when provided with accurate formalization, the theorem prediction stage faces a second critical bottleneck: a fixed theorem library and finite search budget can leave the solver at a deductive impasse\. State\-of\-the\-art GPS frameworks\(Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10); Zhanget al\.,[2024b](https://arxiv.org/html/2606.27926#bib.bib49)\)usually operate with a pre\-defined library of geometric rules, so they may fail when a problem requires a rarely selected theorem instance, a small auxiliary construction, or an intermediate relation such as cyclicity, similarity, or proportionality\. We therefore treat theorem proposing as a bounded, verifier\-controlled proposal problem rather than as free\-form axiom generation: a neural agent may suggest local auxiliary lemmas from the current proof state, but only the symbolic back\-end can accept, reject, and use them\.

![Refer to caption](https://arxiv.org/html/2606.27926v1/x1.png)Figure 1:Performance comparison among existing methods\.To resolve these misalignments, we proposeSD\-GPS, a solver\-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction\. First, Solver\-Driven Autoformalization replaces the decoupled pipeline with a unified multimodal formalizer built on QwenVL3\-2B\. It reads the raw diagram and problem text jointly, learns the target formal language through supervised adaptation, and is further optimized with solvability\-guided feedback from the solver\. This makes executability, rather than surface\-level textual similarity alone, the central training signal\. Second, Solver\-Driven Theorem Proposing addresses solver\-side impasses\. Instead of allowing a neural model to add unchecked axioms, the agent proposes bounded auxiliary lemmas or theorem instantiations from the current proof state, and every accepted proposal must pass symbolic verification before it can drive the derivation\.

Empirical evaluations on Geometry3KLuet al\.\([2021](https://arxiv.org/html/2606.27926#bib.bib1)\)and PGPS9KZhanget al\.\([2023](https://arxiv.org/html/2606.27926#bib.bib3)\)show consistent improvements over MLLM, neural, and neuro\-symbolic baselines\. As summarized in Figure[1](https://arxiv.org/html/2606.27926#S1.F1),SD\-GPSimproves completion accuracy on both benchmarks over the strongest listed prior systems\. The results suggest that closing the loop between multimodal perception and symbolic execution improves both solver\-ready formalization and final geometric reasoning\.

In summary, our contributions are three\-fold:

- •Solver\-Driven Autoformalization:We unify supervised formal\-language adaptation and solvability\-guided reinforcement learning into a single autoformalization module that produces solver\-executable predicates from raw diagram\-text inputs\.
- •Verified Theorem Proposing:We introduce an impasse\-aware theorem\-proposing agent that proposes auxiliary lemmas from solver states, while preserving soundness by accepting only proposals verified by the symbolic back\-end\.
- •Systematic Evaluation:We evaluate under standard completion and choice settings, cross\-modal point\-reference regimes, and OCR\-induced failure modes, showing robust gains on Geometry3K and PGPS9K\.

## 2Related Work

Geometry problem solving\.The field of GPS has evolved from classical rule\-based engines to contemporary neural\-symbolic hybrids\. Early symbolic methods, such as Wu’s Method\(Wu,[1978](https://arxiv.org/html/2606.27926#bib.bib50)\)and Groebner Basis\(Buchberger,[1976](https://arxiv.org/html/2606.27926#bib.bib36)\), established a foundation of mathematical rigor but were fundamentally limited by their inability to interpret informal multimodal inputs\. To bridge this gap, modern frameworks\(Wuet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib2); Penget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib5)\)like Inter\-GPS\(Luet al\.,[2021](https://arxiv.org/html/2606.27926#bib.bib1)\)and UniGeo\(Chenget al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib37)\)integrate deep learning parsers with symbolic solvers\. However, these systems predominantly rely on a decoupled paradigm where text and diagrams are processed by isolated neural modules\. Recent advancements, notably Pi\-GPS\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib9)\)and AutoGPS\(Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10)\), attempt to mitigate this isolation by introducing specialized post\-hoc rectification stages or diagrammatic heuristics to resolve referential ambiguities\. Despite these efforts, such systems remain tethered to the quality of initial independent parsing results\. This structural separation often leads to cascading information loss, as the formalization process lacks a unified representation to resolve inter\-modal dependencies, particularly when faced with complex spatial relationships that are under\-specified in text or diagram\.

Multimodal autoformalization\.Autoformalization bridges the gap between informal human intuition and machine\-verifiable logic\(Wuet al\.,[2022b](https://arxiv.org/html/2606.27926#bib.bib45)\)\. In geometry, this task is particularly formidable as linguistic descriptions are often logically under\-specified without visual grounding\. The significance of this field is underscored by MATP\-BENCH\(Heet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib21)\), which highlights that formally verified multimodal mathematics remains a grand challenge for frontier MLLMs\. Recent frameworks, such as AutoGPS\(Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10)\)and Pi\-GPS\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib9)\), attempt to address this by translating diagrams into predicates\. However, they predominantly rely on a decoupled paradigm that utilizes specialized, non\-transformer parsers like PGDP\(Haoet al\.,[2022](https://arxiv.org/html/2606.27926#bib.bib4)\), necessitating post\-hoc heuristics to resolve inter\-modal inconsistencies\. Similarly, DFE\-GPS\(Zhanget al\.,[2024c](https://arxiv.org/html/2606.27926#bib.bib44)\)treats diagram formalization merely as an auxiliary signal rather than a primary objective\. While recent MMFORMALIZER\(Xionget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib26)\)extends autoformalization to physics via recursive grounding and rule\-based searches in PhysLean, it focuses on high\-level axiomatic deduction\. In contrast, our framework achieves joint diagram\-text alignment within a unified LMM representation\. By enabling bidirectional disambiguation in a single inference pass, we resolve low\-level topological ambiguities directly from raw pixels, offering a more integrated and scalable pathway for verifiable multimodal reasoning\.

Verifiable reinforcement learning\.Reinforcement learning has also been explored in geometry reasoning, for example GeoDRL formulates deductive reasoning as sequential decision making over geometry logic graphs\(Penget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib5)\)\. More recently, LLM reasoning has shifted toward*verifiable*rewards\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.27926#bib.bib51)\), where correctness can be computed automatically from math or code outputs instead of relying on human preference labels\. DeepSeekMath popularized this paradigm in mathematical reasoning with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib23)\), while GSPO later improved training stability and efficiency through sequence\-level policy optimization\(Zhenget al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib24)\)\. In contrast, our work anchors optimization on solver\-readiness of the formal representation\. We adapt GSPO to multimodal autoformalization with rewards derived from parse validity, executable solver states, and answer\-level solvability\. When gold formal annotations are available, we use them for supervised adaptation and analysis; the reinforcement\-learning stage itself can be driven by parser and solver execution feedback, making the objective closer to the downstream symbolic task\.

## 3Method

### 3\.1Framework Overview

Given a geometry problem consisting of a diagramII, a textual questionTT, and a target answera∗a^\{\\ast\},SD\-GPSaims to generate a formal representationΛ^\\hat\{\\Lambda\}that can be consumed by a symbolic solver𝒮\\mathcal\{S\}\. The system is organized around two solver\-coupled loops\. The first loop learns a multimodal formalizerFθF\_\{\\theta\}:

Λ^=Fθ\(I,T\),\\hat\{\\Lambda\}=F\_\{\\theta\}\(I,T\),\(1\)whereΛ^\\hat\{\\Lambda\}contains formal predicates grounded in both the diagram and the text\. The second loop augments symbolic deduction with a theorem\-proposing agent\. When𝒮\(Λ^\)\\mathcal\{S\}\(\\hat\{\\Lambda\}\)cannot derive the target answer, the agent observes the intermediate proof state and proposes a theorem hypothesisτ^\\hat\{\\tau\}, after which the solver verifies whetherτ^\\hat\{\\tau\}is applicable and useful for continuing the proof\. This design keeps all accepted formalizations and theorem proposals grounded in symbolic execution\.

Figure[2](https://arxiv.org/html/2606.27926#S3.F2)illustrates the overall workflow ofSD\-GPS, including multimodal formalization, solver feedback, repair, and theorem proposal\.

![Refer to caption](https://arxiv.org/html/2606.27926v1/x2.png)Figure 2:Overview of the proposedSD\-GPSframework\. Raw diagram\-text inputs are converted into solver\-executable formal predicates, checked by a symbolic solver, refined through solver\-guided repair when necessary, and augmented with verified auxiliary theorem proposals when the solver reaches a deductive impasse\.
### 3\.2Formalization Data: Curated and Synthetic Sources

To expose the formalizer to realistic textbook phrasing, diverse diagram styles, and annotation noise, we build a mixed formalization corpus from curated real\-world examples and controllable synthetic instances\. The real\-world portion is constructed from Geometry3K\-Train, the training split of PGPS9K, and the PGDP training set\. Since these datasets differ in annotation conventions and were not originally designed as a unified strict formal\-language benchmark, we normalize symbolic expressions, entity names, and diagram\-text grounding formats\. Geometry3K annotations are proofread and standardized to reduce inconsistent predicate usage and ambiguous entity references\. For PGPS9K, diagram\-side formal language is converted into the Inter\-GPS formalism, disambiguated during preprocessing, and then filtered by parser and solver execution\. Samples whose converted predicates fail parsing, entity grounding, or solver verification are removed\.

The synthetic portion is generated from geometry templates that specify a diagram construction program, a predicate\-level formalization, and a target query\. Template parameters control point layout, label visibility, auxiliary marks, algebraic values, and ambiguity type\. We use the solver to reject inconsistent constructions and keep only examples whose predicates are executable and whose target answer is uniquely determined\. This gives the formalizer additional supervision for cases that are underrepresented in real data, such as omitted labels, diagram\-supplied points, dense angle marks, and visually similar point names\.

The final training data are tuples\(I,T,Λ∗,a∗,m\)\(I,T,\\Lambda^\{\\ast\},a^\{\\ast\},m\), whereIIdenotes the diagram,TTdenotes the problem text,Λ∗\\Lambda^\{\\ast\}denotes the target formal representation,a∗a^\{\\ast\}denotes the gold answer, andmmstores metadata about entity grounding, ambiguity type, annotation source, and solver status\. This metadata supports curriculum sampling\. Early training emphasizes clean, solver\-verified examples with direct text\-diagram alignment, while later training gradually increases the proportion of diagram\-dependent, ambiguity\-heavy, and cross\-modal\-mismatch examples\. We keep training, validation, and test problems separated at the problem and diagram levels to reduce leakage between curated or template\-derived instances and held\-out evaluation examples\.

### 3\.3Solver\-Driven Autoformalization

Solver\-Driven Autoformalization is the front\-end loop that converts raw multimodal inputs into a formal representation that the solver can execute\. Rather than first extracting OCR tokens, diagram primitives, and textual predicates through separate modules, the formalizerFθF\_\{\\theta\}receives the raw diagramIIand problem textTTjointly and emits a complete predicate sequenceΛ^\\hat\{\\Lambda\}in one pass\. A lightweight parser checks arity, type compatibility, duplicated declarations, illegal symbols, and unresolved references before any candidate is passed to the symbolic solver𝒮\\mathcal\{S\}\. In our implementation,FθF\_\{\\theta\}is initialized from QwenVL3\-2B\(Baiet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib48)\); all supervised, RL, repair, and theorem\-proposal variants use this same 2B\-parameter multimodal backbone unless explicitly stated otherwise\. The same parser and solver are then used in both training and inference, so the model is optimized toward the interface actually consumed by the downstream reasoner\.

#### Stage I: supervised formal\-language adaptation\.

Given the curated and synthetic formalization dataset

𝒟SFT=\{\(Ii,Ti,Λi∗,ai∗\)\}i=1N,\\mathcal\{D\}\_\{\\mathrm\{SFT\}\}=\\left\\\{\(I\_\{i\},T\_\{i\},\\Lambda\_\{i\}^\{\\ast\},a\_\{i\}^\{\\ast\}\)\\right\\\}\_\{i=1\}^\{N\},\(2\)we first adapt QwenVL3\-2B, a 2B\-parameter pretrained multimodal language model, to the target geometry formal language\. The model generates

Λ^i=Fθ\(Ii,Ti\),\\hat\{\\Lambda\}\_\{i\}=F\_\{\\theta\}\(I\_\{i\},T\_\{i\}\),\(3\)and is trained with the autoregressive negative log\-likelihood of the target predicate sequenceΛi∗=\(yi,1,…,yi,Li\)\\Lambda\_\{i\}^\{\\ast\}=\(y\_\{i,1\},\\ldots,y\_\{i,L\_\{i\}\}\):

ℒSFT\(θ\)=−1N∑i=1N∑t=1Lilog⁡pθ\(yi,t∣Ii,Ti,yi,<t\)\.\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\theta\)=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{L\_\{i\}\}\\log p\_\{\\theta\}\\left\(y\_\{i,t\}\\mid I\_\{i\},T\_\{i\},y\_\{i,<t\}\\right\)\.\(4\)This stage teaches the predicate inventory, output grammar, and cross\-modal grounding patterns, providing a solver\-compatible initialization before execution\-based optimization\.

#### Stage II: solvability\-guided policy optimization\.

After supervised adaptation, we further optimizeFθF\_\{\\theta\}with Solvability\-Guided Reinforcement Learning \(SG\-RL\), instantiated with GSPO\-style group sequence policy optimization\. For each problem, the model samples a group ofKKcandidate formalizations\{Λ^i,j\}j=1K\\\{\\hat\{\\Lambda\}\_\{i,j\}\\\}\_\{j=1\}^\{K\}\. Each candidate is parsed and, if valid, executed by𝒮\\mathcal\{S\}\. We assign a sequence\-level reward

R\(Λ^\)=λfmtRfmt\(Λ^\)\+λexecRexec\(Λ^\)\+λansRans\(Λ^,a∗\),R\(\\hat\{\\Lambda\}\)=\\lambda\_\{\\mathrm\{fmt\}\}R\_\{\\mathrm\{fmt\}\}\(\\hat\{\\Lambda\}\)\+\\lambda\_\{\\mathrm\{exec\}\}R\_\{\\mathrm\{exec\}\}\(\\hat\{\\Lambda\}\)\+\\lambda\_\{\\mathrm\{ans\}\}R\_\{\\mathrm\{ans\}\}\(\\hat\{\\Lambda\},a^\{\\ast\}\),\(5\)where

Rfmt\(Λ^\)=𝕀\[Parse\(Λ^\)=1\],R\_\{\\mathrm\{fmt\}\}\(\\hat\{\\Lambda\}\)=\\mathbb\{I\}\[\\mathrm\{Parse\}\(\\hat\{\\Lambda\}\)=1\],\(6\)Rexec\(Λ^\)=𝕀\[𝒮constructs a non\-contradictory proof state fromΛ^\],R\_\{\\mathrm\{exec\}\}\(\\hat\{\\Lambda\}\)=\\mathbb\{I\}\[\\mathcal\{S\}\\ \\text\{constructs a non\-contradictory proof state from \}\\hat\{\\Lambda\}\],\(7\)and

Rans\(Λ^,a∗\)=𝕀\[𝒮\(Λ^\)=a∗\]\.R\_\{\\mathrm\{ans\}\}\(\\hat\{\\Lambda\},a^\{\\ast\}\)=\\mathbb\{I\}\[\\mathcal\{S\}\(\\hat\{\\Lambda\}\)=a^\{\\ast\}\]\.\(8\)The format term rewards syntactic validity, the execution term rewards usable intermediate solver states, and the answer term rewards candidates that are sufficient for deriving the gold answer\. Invalid or unparsable candidates receive no execution or answer reward\. Within each sampled group, rewards are normalized into sequence\-level advantages and used to update the policy, so formalizations that are merely plausible but unusable by the solver are discouraged\. Unlike the supervised stage, SG\-RL does not require gold formal\-language annotations for every problem; it requires the problem input, the target answer, and solver feedback\.

#### Inference\-time use\.

At test time, the same parser\-solver interface provides structured diagnostics\. If a candidate fails parsing or execution, it is sent to the bounded repair loop described in Appendix[A\.2](https://arxiv.org/html/2606.27926#A1.SS2)\. If the candidate is executable but the solver cannot close the proof within the search budget, the system invokes the theorem\-proposing agent in Section[3\.4](https://arxiv.org/html/2606.27926#S3.SS4)\.

### 3\.4Solver\-Driven Theorem\-Proposing Agent

Even a correct and executable formalization may be insufficient when the symbolic solver is restricted to a fixed theorem library or a finite search budget\. The Solver\-Driven Theorem\-Proposing Agent addresses this bottleneck by acting as a proposal module over the current proof state, not as an unchecked source of new axioms\. In our implementation, the agent is driven by the same QwenVL3\-2B backbone as the formalizer, but its output is restricted to a verifier\-readable proposal schema\. It is activated only after𝒮\\mathcal\{S\}has successfully parsedΛ^\\hat\{\\Lambda\}and built a consistent proof state but cannot derive the target queryqq\. At iterationtt, the solver exposes a state summaryhth\_\{t\}containing known facts, unresolved subgoals, failed theorem matches, and a compact diagnostic messagedtd\_\{t\}\. Conditioned on\(Λ^,q,ht,dt\)\(\\hat\{\\Lambda\},q,h\_\{t\},d\_\{t\}\), the agent proposes a bounded set of candidate auxiliary lemmas

𝒯t=Aϕ\(Λ^,q,ht,dt\),\\mathcal\{T\}\_\{t\}=A\_\{\\phi\}\(\\hat\{\\Lambda\},q,h\_\{t\},d\_\{t\}\),\(9\)where each candidate is represented as a verifier\-readable tuple containing a lemma type, required premises, proposed conclusion, involved entities, and a short rationale\.

The solver filters these candidates before any of them can affect the derivation\. A proposal is discarded if it contains unknown entities, violates type constraints, contradicts existing facts, cannot be instantiated against the current proof state, cannot be verified from admissible premises, or fails to reduce any unresolved subgoal\. For an accepted proposal, the verifier returns a local proof certificate or a deterministic construction trace, and the derived fact is scoped only to the current problem instance\. It is removed after the example is solved or declared unresolved\. This scoping is important: the agent may suggest a useful lemma instantiation or auxiliary relation, but the symbolic verifier remains responsible for deciding whether the suggestion is admissible\. In practice, the agent mostly helps in three cases: selecting a theorem instance that the default search failed to prioritize, exposing a latent relation such as cyclicity or similarity, and proposing a small auxiliary construction that makes an existing theorem applicable\.

Algorithm 1Solver\-driven theorem proposal and verification1:Formal predicates

Λ^\\hat\{\\Lambda\}, target query

qq, solver

𝒮\\mathcal\{S\}, theorem agent

AϕA\_\{\\phi\}, maximum proposal rounds

TmaxT\_\{\\max\}, proposal budget

KK
2:Solved answer or unresolved status

3:

\(s0,h0\)←𝒮\.Initialize\(Λ^,q\)\(s\_\{0\},h\_\{0\}\)\\leftarrow\\mathcal\{S\}\.\\textsc\{Initialize\}\(\\hat\{\\Lambda\},q\)
4:if

s0s\_\{0\}is invalid or contradictorythen

5:returnunresolved⊳\\trianglerightHandled by formalization repair, not theorem proposal

6:endif

7:if

𝒮\.Closed\(s0,q\)\\mathcal\{S\}\.\\textsc\{Closed\}\(s\_\{0\},q\)then

8:return

𝒮\.Answer\(s0,q\)\\mathcal\{S\}\.\\textsc\{Answer\}\(s\_\{0\},q\)
9:endif

10:for

t=1t=1to

TmaxT\_\{\\max\}do

11:

st←st−1s\_\{t\}\\leftarrow s\_\{t\-1\}
12:

dt←𝒮\.Diagnose\(st−1,q\)d\_\{t\}\\leftarrow\\mathcal\{S\}\.\\textsc\{Diagnose\}\(s\_\{t\-1\},q\)
13:

𝒯t←Aϕ\(Λ^,q,ht−1,dt;K\)\\mathcal\{T\}\_\{t\}\\leftarrow A\_\{\\phi\}\(\\hat\{\\Lambda\},q,h\_\{t\-1\},d\_\{t\};K\)
14:for all

τ∈𝒯t\\tau\\in\\mathcal\{T\}\_\{t\}do

15:ifnot

WellTyped\(τ,Λ^\)\\textsc\{WellTyped\}\(\\tau,\\hat\{\\Lambda\}\)then

16:discard

τ\\tau
17:elseifnot

𝒮\.Applicable\(τ,st−1\)\\mathcal\{S\}\.\\textsc\{Applicable\}\(\\tau,s\_\{t\-1\}\)then

18:discard

τ\\tau
19:elseifnot

𝒮\.Verify\(τ,st−1\)\\mathcal\{S\}\.\\textsc\{Verify\}\(\\tau,s\_\{t\-1\}\)then

20:discard

τ\\tau
21:elseifnot

𝒮\.GoalReduced\(τ,st−1,q\)\\mathcal\{S\}\.\\textsc\{GoalReduced\}\(\\tau,s\_\{t\-1\},q\)then

22:discard

τ\\tau
23:else

24:

\(st,cτ\)←𝒮\.AddLocalLemma\(st−1,τ\)\(s\_\{t\},c\_\{\\tau\}\)\\leftarrow\\mathcal\{S\}\.\\textsc\{AddLocalLemma\}\(s\_\{t\-1\},\\tau\)
25:

st←𝒮\.Search\(st,q\)s\_\{t\}\\leftarrow\\mathcal\{S\}\.\\textsc\{Search\}\(s\_\{t\},q\)
26:if

𝒮\.Closed\(st,q\)\\mathcal\{S\}\.\\textsc\{Closed\}\(s\_\{t\},q\)then

27:return

𝒮\.Answer\(st,q\)\\mathcal\{S\}\.\\textsc\{Answer\}\(s\_\{t\},q\)
28:endif

29:endif

30:endfor

31:

ht←𝒮\.Summarize\(st,q\)h\_\{t\}\\leftarrow\\mathcal\{S\}\.\\textsc\{Summarize\}\(s\_\{t\},q\)
32:endfor

33:returnunresolved

Algorithm[1](https://arxiv.org/html/2606.27926#alg1)summarizes the loop\. The proposal budget keeps inference bounded, while the verifier checks ensure that neural proposals cannot directly determine the answer\. The stored certificatecτc\_\{\\tau\}is used only inside the current proof attempt, so accepted proposals behave as verified local lemmas rather than permanent changes to the theorem library\. The agent therefore expands the solver’s search behavior without replacing symbolic proof checking\.

## 4Experiments

Table 1:Results on Geometry3K and PGPS9K\.#### Datasets\.

We evaluate on Geometry3K\(Luet al\.,[2021](https://arxiv.org/html/2606.27926#bib.bib1)\)and PGPS9K\(Zhanget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib3)\), the two benchmarks used to test whether solver\-driven formalization transfers from raw multimodal input to final problem solving\. Geometry3K contains 3,002 geometry problems, split into 2,101 training, 300 validation, and 601 test problems\. Each problem includes a diagram, problem text, and formal\-language annotations\. PGPS9K extends Geometry3K to 9,022 problems paired with 4,000 unique diagrams and covers common plane\-geometry types from middle\-school and high\-school textbooks\.

#### Metrics\.

Following prior GPS evaluations\(Luet al\.,[2021](https://arxiv.org/html/2606.27926#bib.bib1); Zhanget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib3); Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10); Zhaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib9)\), we report final problem\-solving accuracy under two settings\.*Completion*evaluates open\-ended answer generation, where the system must derive the final value or expression\.*Choice*evaluates multiple\-choice solving\. For ablations on ambiguity resolution, we additionally report final solver\-executed accuracy under three text\-diagram evidence regimes:*text\-complete*, where the text\-mentioned point set equals the diagram point set;*diagram\-augmented*, where the diagram contains additional points required for grounding; and*cross\-modal mismatch*, where the model must resolve weaker or inconsistent entity evidence\.

#### Baselines\.

We compare with three groups of methods that correspond to the failure modes discussed in the introduction\. The first group contains general MLLMs prompted to solve or formalize the problem directly, including G\-LLaVA\(Gaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib12)\), Vision\-R1\(Huanget al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib13)\), Qwen2\.5\-VL\(Qwenet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib20)\), QwenVL3\-2B\(Baiet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib48)\), InternVL\(Chenet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib18); Zhuet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib19)\), and GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.27926#bib.bib11)\)\. The second group contains neural GPS systems such as NGS\(Chenet al\.,[2022](https://arxiv.org/html/2606.27926#bib.bib35)\), Geoformer\(Khomiakovet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib32)\), SCA\-GPS\(Ninget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib31)\), GOLD\(Zhang and Moshfeghi,[2024](https://arxiv.org/html/2606.27926#bib.bib33)\), PGPSNet\(Zhanget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib3)\), PGPSNet\-v2\(Zhanget al\.,[2024a](https://arxiv.org/html/2606.27926#bib.bib34)\), and LANS\(Liet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib30)\)\. The third group contains symbolic or neuro\-symbolic solvers such as Inter\-GPS\(Luet al\.,[2021](https://arxiv.org/html/2606.27926#bib.bib1)\), E\-GPS\(Wuet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib2)\), GeoDRL\(Penget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib5)\), Auto\-GPS\(Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10)\), and Pi\-GPS\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib9)\)\. Where available, we separately report OCR\-based variants\(Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10); Zhaoet al\.,[2025](https://arxiv.org/html/2606.27926#bib.bib9)\), because they expose the brittleness of cascaded perception\-to\-logic pipelines under fully automatic input processing\.

### 4\.1Results

Table[1](https://arxiv.org/html/2606.27926#S4.T1)summarizes the results on Geometry3K and PGPS9K\.SD\-GPSachieves the best accuracy across all four evaluation settings\. On Geometry3K, it obtains 86\.4% completion accuracy and 90\.4% choice accuracy, improving over the strongest listed prior results by 3\.5 and 3\.2 percentage points, respectively\. On PGPS9K,SD\-GPSreaches 79\.8% completion accuracy and 84\.5% choice accuracy, surpassing the strongest listed prior results by 4\.4 and 3\.0 percentage points\. These gains indicate that solver\-driven formalization improves not only predicate quality but also final reasoning performance\.

The comparison also highlights the limitations of direct MLLM prompting\. Although recent MLLMs perform well on general multimodal tasks, their accuracy remains limited on formal geometry reasoning, especially in completion settings where exact executable reasoning is required\. The prompted QwenVL3\-2B baseline is therefore reported separately fromSD\-GPS:SD\-GPSuses the same backbone but adds solver\-driven formal\-language adaptation, SG\-RL, bounded repair, and verified theorem proposal\. In contrast to direct prompting,SD\-GPSexplicitly converts multimodal inputs into solver\-ready predicates and uses execution feedback to penalize incomplete or unusable formalizations\. The OCR\-based variants of Auto\-GPS and Pi\-GPS further show that cascaded systems are sensitive to upstream recognition errors, reinforcing the need for integrated cross\-modal formalization\.

### 4\.2Ablation Study

#### Robustness under cross\-modal ambiguity\.

Table[2](https://arxiv.org/html/2606.27926#S4.T2)reports final solver\-executed accuracy under different relations between text\-mentioned points and diagram points\. The SFT\-only formalizer performs strongly when the textual point set is complete, but its accuracy decreases in the diagram\-augmented and cross\-modal\-mismatch regimes\. This suggests that cross\-modal ambiguity is not merely a post\-processing issue: missing, redundant, or inconsistently grounded entities can directly change the executable predicates provided to the symbolic solver\.

#### Effect of SG\-RL and fallback repair\.

Introducing SG\-RL improves the overall accuracy from 74\.9% to 79\.5%, indicating that solver\-derived feedback provides a useful supervisory signal beyond predicate\-level imitation\. The improvement is particularly pronounced in the cross\-modal\-mismatch setting, where accuracy increases from 67\.2% to 74\.8%\. This result suggests that execution feedback helps the model prefer formalizations that are not only syntactically valid but also more compatible with downstream symbolic reasoning\. Adding fallback repair further raises the overall accuracy to 84\.5%, with a notable gain in the diagram\-augmented setting from 77\.1% to 84\.3%\. These results show that bounded execution\-aware regeneration can recover from invalid, incomplete, or weakly grounded formalizations\.

#### Effect of theorem proposal\.

The final row of Table[2](https://arxiv.org/html/2606.27926#S4.T2)incorporates the solver\-driven theorem\-proposing agent, increasing the overall accuracy to 86\.4%\. After formalization errors are reduced, some remaining failures are caused by the limited coverage of the fixed theorem library rather than by incorrect parsing alone\. By observing the solver state and proposing verifiable auxiliary lemmas, the theorem\-proposing agent extends the effective reasoning coverage of the symbolic back\-end while preserving verification constraints\. Importantly, the proposed lemmas are not accepted as unconstrained neural guesses; they must be checked by the solver before contributing to the final derivation\.

#### Effect of task\-specific formalization training\.

Table[3](https://arxiv.org/html/2606.27926#S4.T3)compares prompt\-based formal\-language generation by general\-purpose MLLMs with formalizers adapted to the target geometry language\. The results show that task\-specific adaptation produces more reliable executable formalizations, especially under the syntax, grounding, and predicate constraints imposed by the downstream solver\. This supports the need for dedicated formalization training rather than relying solely on prompt\-based generation from general\-purpose multimodal models\.

Table 2:Fine\-grained end\-to\-end accuracy under different text\-diagram point relations\.Text\-completemeans the point set mentioned in the problem text equals the diagram point set\.Diagram\-augmentedmeans the text\-mentioned point set is a strict subset of the diagram point set\.Cross\-modal mismatchcovers the remaining cases\.Table 3:End\-to\-end solver execution accuracy using model\-generated formal representations\. All models generate formal representations that are executed by the same symbolic solver\. General\-purpose MLLMs are evaluated with the same prompt\. TheSD\-GPSrows use QwenVL3\-2B as the backbone; the QwenVL3\-2B row denotes the prompted base model without solver\-driven adaptation\.

## 5Limitations

Our evaluation is limited to plane\-geometry problems that can be represented with an Inter\-GPS\-style formal language\. Extending the framework to three\-dimensional geometry, construction\-based problems, or free\-form proof generation may require a richer predicate vocabulary and additional symbolic verification rules\.

The method also depends on the coverage of the symbolic solver\. Solver feedback is useful for training and verification, but solver failure may arise from missing theorem rules or limited search depth rather than incorrect formalization\. Similarly, the theorem\-proposing agent is constrained by the verifier: its proposed auxiliary relations are accepted only when they pass symbolic checks and help reduce the unresolved goal\. The agent should therefore be interpreted as a bounded proposal mechanism for theorem instances and auxiliary relations, not as a replacement for formal proof certification\.

## 6Conclusion

We presentedSD\-GPS, a solver\-driven framework for verifiable geometry problem solving\. Instead of treating autoformalization as a static translation task and theorem prediction as a closed\-library selection problem,SD\-GPSuses symbolic execution as an active supervisory signal\. A unified multimodal formalizer produces solver\-ready predicates from raw diagram\-text inputs, solver\-verified reinforcement learning aligns training with downstream executability, and an iterative theorem\-proposing agent expands the solver’s deductive reach through verified lemma proposals\. Experiments on Geometry3K and PGPS9K show consistent improvements over MLLM, neural, and neuro\-symbolic baselines\. These results suggest that closing the loop between perception and symbolic verification is a promising direction for robust multimodal mathematical reasoning\. Future work will extend the framework to broader formal languages, stronger proof\-certificate generation, and more diverse geometric domains beyond plane\-geometry benchmarks\.

## References

- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge, W\. Ge, Z\. Guo, Q\. Huang, J\. Huang, F\. Huang, B\. Hui, S\. Jiang, Z\. Li, M\. Li, M\. Li, K\. Li, Z\. Lin, J\. Lin, X\. Liu, J\. Liu, C\. Liu, Y\. Liu, D\. Liu, S\. Liu, D\. Lu, R\. Luo, C\. Lv, R\. Men, L\. Meng, X\. Ren, X\. Ren, S\. Song, Y\. Sun, J\. Tang, J\. Tu, J\. Wan, P\. Wang, P\. Wang, Q\. Wang, Y\. Wang, T\. Xie, Y\. Xu, H\. Xu, J\. Xu, Z\. Yang, M\. Yang, J\. Yang, A\. Yang, B\. Yu, F\. Zhang, H\. Zhang, X\. Zhang, B\. Zheng, H\. Zhong, J\. Zhou, F\. Zhou, J\. Zhou, Y\. Zhu, and K\. Zhu \(2025\)Qwen3\-vl technical report\.External Links:2511\.21631,[Link](https://arxiv.org/abs/2511.21631)Cited by:[§3\.3](https://arxiv.org/html/2606.27926#S3.SS3.p1.6),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.12.12.1)\.
- B\. Buchberger \(1976\)A theoretical basis for the reduction of polynomials to canonical forms\.10\(3\),pp\. 19–29\.External Links:ISSN 0163\-5824,[Link](https://doi.org/10.1145/1088216.1088219),[Document](https://dx.doi.org/10.1145/1088216.1088219)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p1.1)\.
- J\. Chen, J\. Tang, J\. Qin, X\. Liang, L\. Liu, E\. P\. Xing, and L\. Lin \(2022\)GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning\.External Links:2105\.14517,[Link](https://arxiv.org/abs/2105.14517)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.14.14.1)\.
- Z\. Chen, J\. Wu, W\. Wang, W\. Su, G\. Chen, S\. Xing, M\. Zhong, Q\. Zhang, X\. Zhu, L\. Lu,et al\.\(2024\)Internvl: scaling up vision foundation models and aligning for generic visual\-linguistic tasks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 24185–24198\.Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.7.7.1)\.
- J\. Cheng, Z\. Zhang, R\. Chen, J\. Deng, Z\. Qin, and J\. Ma \(2025\)GeoUni: a unified model for generating geometry diagrams, problems and problem solutions\.External Links:2504\.10146,[Link](https://arxiv.org/abs/2504.10146)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p1.1)\.
- Y\. Chervonyi, T\. H\. Trinh, M\. Olšák, X\. Yang, H\. Nguyen, M\. Menegali, J\. Jung, J\. Kim, V\. Verma, Q\. V\. Le, and T\. Luong \(2025\)Gold\-medalist performance in solving olympiad geometry with alphageometry2\.External Links:2502\.03544,[Link](https://arxiv.org/abs/2502.03544)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p3.1)\.
- J\. Gao, R\. Pi, J\. Zhang, J\. Ye, W\. Zhong, Y\. Wang, L\. HONG, J\. Han, H\. Xu, Z\. Li, and L\. Kong \(2025\)G\-LLaVA: solving geometric problem with multi\-modal large language model\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=px1674Wp3C)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.4.4.1)\.
- H\. Gelernter, J\. R\. Hansen, and D\. W\. Loveland \(1960\)Empirical explorations of the geometry theorem machine\.InPapers Presented at the May 3\-5, 1960, Western Joint IRE\-AIEE\-ACM Computer Conference,IRE\-AIEE\-ACM ’60 \(Western\),New York, NY, USA,pp\. 143–149\.External Links:ISBN 9781450378697,[Link](https://doi.org/10.1145/1460361.1460381),[Document](https://dx.doi.org/10.1145/1460361.1460381)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p1.1)\.
- Y\. Hao, M\. Zhang, F\. Yin, and L\. Huang \(2022\)PGDP5K: a diagram parsing dataset for plane geometry problems\.In2022 26th international conference on pattern recognition \(ICPR\),pp\. 1763–1769\.Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p2.1)\.
- Z\. He, Z\. Lyu, D\. Chen, D\. Guo, and Y\. R\. Fung \(2025\)MATP\-bench: can mllm be a good automated theorem prover for multimodal problems?\.External Links:2506\.06034,[Link](https://arxiv.org/abs/2506.06034)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p2.1)\.
- W\. Huang, B\. Jia, Z\. Zhai, S\. Cao, Z\. Ye, F\. Zhao, Z\. Xu, Y\. Hu, and S\. Lin \(2025\)Vision\-r1: incentivizing reasoning capability in multimodal large language models\.External Links:2503\.06749,[Link](https://arxiv.org/abs/2503.06749)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.5.5.1)\.
- M\. Khomiakov, M\. R\. Andersen, and J\. Frellsen \(2024\)GeoFormer: a multi\-polygon segmentation transformer\.External Links:2411\.16616,[Link](https://arxiv.org/abs/2411.16616)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.15.15.1)\.
- Z\. Li, M\. Zhang, F\. Yin, and C\. Liu \(2024\)LANS: a layout\-aware neural solver for plane geometry problem\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2596–2608\.External Links:[Link](https://aclanthology.org/2024.findings-acl.153/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.153)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.19.19.1)\.
- P\. Lu, R\. Gong, S\. Jiang, L\. Qiu, S\. Huang, X\. Liang, and S\. Zhu \(2021\)Inter\-gps: interpretable geometry problem solving with formal language and symbolic reasoning\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 6774–6786\.Cited by:[§A\.1](https://arxiv.org/html/2606.27926#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.27926#S1.p2.1),[§1](https://arxiv.org/html/2606.27926#S1.p5.1),[§2](https://arxiv.org/html/2606.27926#S2.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.22.22.1)\.
- M\. Ning, Q\. Wang, K\. Huang, and X\. Huang \(2023\)A symbolic character\-aware model for solving geometry problems\.External Links:2308\.02823,[Link](https://arxiv.org/abs/2308.02823)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.16.16.1)\.
- OpenAI \(2024\)GPT\-4o system card\.External Links:2410\.21276,[Link](https://arxiv.org/abs/2410.21276)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.9.9.1)\.
- S\. Peng, D\. Fu, Y\. Liang, L\. Gao, and Z\. Tang \(2023\)Geodrl: a self\-learning framework for geometry problem solving using reinforcement learning in deductive reasoning\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 13468–13480\.Cited by:[§A\.1](https://arxiv.org/html/2606.27926#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.27926#S1.p1.1),[§2](https://arxiv.org/html/2606.27926#S2.p1.1),[§2](https://arxiv.org/html/2606.27926#S2.p3.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.24.24.1)\.
- B\. Ping, M\. Luo, Z\. Dang, C\. Wang, and C\. Jia \(2026\)AutoGPS: automated geometry problem solving via multimodal formalization and deductive reasoning\.External Links:2505\.23381,[Link](https://arxiv.org/abs/2505.23381)Cited by:[§A\.1](https://arxiv.org/html/2606.27926#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.27926#S1.p2.1),[§1](https://arxiv.org/html/2606.27926#S1.p3.1),[§2](https://arxiv.org/html/2606.27926#S2.p1.1),[§2](https://arxiv.org/html/2606.27926#S2.p2.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.25.25.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.26.26.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.6.6.1)\.
- M\. Sachan and E\. Xing \(2017\)Learning to solve geometry problems from natural language demonstrations in textbooks\.InProceedings of the 6th Joint Conference on Lexical and Computational Semantics \(\*SEM 2017\),N\. Ide, A\. Herbelot, and L\. Màrquez \(Eds\.\),Vancouver, Canada,pp\. 251–261\.External Links:[Link](https://aclanthology.org/S17-1029/),[Document](https://dx.doi.org/10.18653/v1/S17-1029)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p3.1)\.
- K\. Team, T\. Bai, Y\. Bai, Y\. Bao, S\. H\. Cai, Y\. Cao, Y\. Charles, H\. S\. Che, C\. Chen, G\. Chen, H\. Chen, J\. Chen, J\. Chen, J\. Chen, J\. Chen, K\. Chen, L\. Chen, R\. Chen, X\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Z\. Chen, Z\. Chen, D\. Cheng, M\. Chu, J\. Cui, J\. Deng, M\. Diao, H\. Ding, M\. Dong, M\. Dong, Y\. Dong, Y\. Dong, A\. Du, C\. Du, D\. Du, L\. Du, Y\. Du, Y\. Fan, S\. Fang, Q\. Feng, Y\. Feng, G\. Fu, K\. Fu, H\. Gao, T\. Gao, Y\. Ge, S\. Geng, C\. Gong, X\. Gong, Z\. Gongque, Q\. Gu, X\. Gu, Y\. Gu, L\. Guan, Y\. Guo, X\. Hao, W\. He, W\. He, Y\. He, C\. Hong, H\. Hu, J\. Hu, Y\. Hu, Z\. Hu, K\. Huang, R\. Huang, W\. Huang, Z\. Huang, T\. Jiang, Z\. Jiang, X\. Jin, Y\. Jing, G\. Lai, A\. Li, C\. Li, C\. Li, F\. Li, G\. Li, G\. Li, H\. Li, H\. Li, J\. Li, J\. Li, J\. Li, L\. Li, M\. Li, W\. Li, W\. Li, X\. Li, X\. Li, Y\. Li, Y\. Li, Y\. Li, Y\. Li, Z\. Li, Z\. Li, W\. Liao, J\. Lin, X\. Lin, Z\. Lin, Z\. Lin, C\. Liu, C\. Liu, H\. Liu, L\. Liu, S\. Liu, S\. Liu, S\. Liu, T\. Liu, T\. Liu, W\. Liu, X\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Z\. Liu, Z\. Liu, E\. Lu, H\. Lu, Z\. Lu, J\. Luo, T\. Luo, Y\. Luo, L\. Ma, Y\. Ma, S\. Mao, Y\. Mei, X\. Men, F\. Meng, Z\. Meng, Y\. Miao, M\. Ni, K\. Ouyang, S\. Pan, B\. Pang, Y\. Qian, R\. Qin, Z\. Qin, J\. Qiu, B\. Qu, Z\. Shang, Y\. Shao, T\. Shen, Z\. Shen, J\. Shi, L\. Shi, S\. Shi, F\. Song, P\. Song, T\. Song, X\. Song, H\. Su, J\. Su, Z\. Su, L\. Sui, J\. Sun, J\. Sun, T\. Sun, F\. Sung, Y\. Tai, C\. Tang, H\. Tang, X\. Tang, Z\. Tang, J\. Tao, S\. Teng, C\. Tian, P\. Tian, A\. Wang, B\. Wang, C\. Wang, C\. Wang, C\. Wang, D\. Wang, D\. Wang, D\. Wang, F\. Wang, H\. Wang, H\. Wang, H\. Wang, H\. Wang, H\. Wang, J\. Wang, J\. Wang, J\. Wang, K\. Wang, L\. Wang, Q\. Wang, S\. Wang, S\. Wang, S\. Wang, W\. Wang, X\. Wang, X\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, C\. Wei, M\. Wei, C\. Wen, Z\. Wen, C\. Wu, H\. Wu, J\. Wu, R\. Wu, W\. Wu, Y\. Wu, Y\. Wu, Y\. Wu, Z\. Wu, C\. Xiao, J\. Xie, X\. Xie, Y\. Xie, Y\. Xin, B\. Xing, B\. Xu, J\. Xu, J\. Xu, J\. Xu, L\. H\. Xu, L\. Xu, S\. Xu, W\. Xu, X\. Xu, X\. Xu, Y\. Xu, Y\. Xu, Y\. Xu, Z\. Xu, Z\. Xu, J\. Yan, Y\. Yan, G\. Yang, H\. Yang, J\. Yang, K\. Yang, N\. Yang, R\. Yang, X\. Yang, X\. Yang, Y\. Yang, Y\. Yang, Y\. Yang, Z\. Yang, Z\. Yang, Z\. Yang, H\. Yao, D\. Ye, W\. Ye, Z\. Ye, B\. Yin, C\. Yu, L\. Yu, T\. Yu, T\. Yu, E\. Yuan, M\. Yuan, X\. Yuan, Y\. Yue, W\. Zeng, D\. Zha, H\. Zhan, D\. Zhang, H\. Zhang, J\. Zhang, P\. Zhang, Q\. Zhang, R\. Zhang, X\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Z\. Zhang, C\. Zhao, F\. Zhao, J\. Zhao, S\. Zhao, X\. Zhao, Y\. Zhao, Z\. Zhao, H\. Zheng, R\. Zheng, S\. Zheng, T\. Zheng, J\. Zhong, L\. Zhong, W\. Zhong, M\. Zhou, R\. Zhou, X\. Zhou, Z\. Zhou, J\. Zhu, L\. Zhu, X\. Zhu, Y\. Zhu, Z\. Zhu, J\. Zhuang, W\. Zhuang, Y\. Zou, and X\. Zu \(2026a\)Kimi k2\.5: visual agentic intelligence\.External Links:2602\.02276,[Link](https://arxiv.org/abs/2602.02276)Cited by:[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.11.11.1)\.
- V\. Team, W\. Hong, W\. Yu, X\. Gu, G\. Wang, G\. Gan, H\. Tang, J\. Cheng, J\. Qi, J\. Ji, L\. Pan, S\. Duan, W\. Wang, Y\. Wang, Y\. Cheng, Z\. He, Z\. Su, Z\. Yang, Z\. Pan, A\. Zeng, B\. Wang, B\. Chen, B\. Shi, C\. Pang, C\. Zhang, D\. Yin, F\. Yang, G\. Chen, H\. Li, J\. Zhu, J\. Chen, J\. Xu, J\. Xu, J\. Chen, J\. Lin, J\. Chen, J\. Wang, J\. Chen, L\. Lei, L\. Gong, L\. Pan, M\. Liu, M\. Xu, M\. Zhang, Q\. Zheng, R\. Lyu, S\. Tu, S\. Yang, S\. Meng, S\. Zhong, S\. Huang, S\. Zhao, S\. Xue, T\. Zhang, T\. Luo, T\. Hao, T\. Tong, W\. Jia, W\. Li, X\. Liu, X\. Zhang, X\. Lyu, X\. Zhang, X\. Fan, X\. Huang, Y\. Xue, Y\. Wang, Y\. Wang, Y\. Wang, Y\. An, Y\. Du, Y\. Huang, Y\. Niu, Y\. Shi, Y\. Wang, Y\. Wang, Y\. Yue, Y\. Li, Y\. Liu, Y\. Zhang, Y\. Wang, Y\. Zhang, Z\. Xue, Z\. Du, Z\. Hou, Z\. Wang, P\. Zhang, D\. Liu, B\. Xu, J\. Li, M\. Huang, Y\. Dong, and J\. Tang \(2026b\)GLM\-4\.5v and glm\-4\.1v\-thinking: towards versatile multimodal reasoning with scalable reinforcement learning\.External Links:2507\.01006,[Link](https://arxiv.org/abs/2507.01006)Cited by:[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.10.10.1)\.
- T\. Trinh, Y\. Wu, Q\. Le, H\. He, and T\. Luong \(2024\)Solving olympiad geometry without human demonstrations\.Nature\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06747-5)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p1.1)\.
- W\. Wu \(1978\)On the decision problem and the mechanization of theorem\-proving in elementary geometry\.21\(2\),pp\. 159–172\.Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p1.1)\.
- W\. Wu \(1986\)Basic principles of mechanical theorem proving in elementary geometrics\.2\(3\),pp\. 221–252\.External Links:ISSN 0168\-7433,[Link](https://doi.org/10.1007/BF02328447),[Document](https://dx.doi.org/10.1007/BF02328447)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p1.1)\.
- W\. Wu, L\. Zhang, J\. Liu, X\. Tang, Y\. Wang, S\. Wang, and Q\. Wang \(2024\)E\-gps: explainable geometry problem solving via top\-down solver and bottom\-up generator\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13828–13837\.Cited by:[§A\.1](https://arxiv.org/html/2606.27926#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.27926#S1.p1.1),[§2](https://arxiv.org/html/2606.27926#S2.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.23.23.1)\.
- Y\. Wu, A\. Q\. Jiang, W\. Li, M\. N\. Rabe, C\. Staats, M\. Jamnik, and C\. Szegedy \(2022a\)Autoformalization with large language models\.External Links:2205\.12615,[Link](https://arxiv.org/abs/2205.12615)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p1.1)\.
- Y\. Wu, A\. Q\. Jiang, W\. Li, M\. Rabe, C\. Staats, M\. Jamnik, and C\. Szegedy \(2022b\)Autoformalization with large language models\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 32353–32368\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/d0c6bc641a56bebee9d985b937307367-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p2.1)\.
- J\. Xiong, Q\. Han, Y\. Hsieh, H\. Shen, H\. Xin, C\. Tao, C\. Zhao, H\. Zhang, T\. Wu, Z\. Zhang, H\. Wang, Z\. Wan, L\. Kong, and N\. Wong \(2026\)MMFormalizer: multimodal autoformalization in the wild\.External Links:2601\.03017,[Link](https://arxiv.org/abs/2601.03017)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p2.1)\.
- J\. Zhang and Y\. Moshfeghi \(2024\)GOLD: geometry problem solver with natural language description\.External Links:2405\.00494,[Link](https://arxiv.org/abs/2405.00494)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.17.17.1)\.
- M\. Zhang, Z\. Li, F\. Yin, L\. Lin, and C\. Liu \(2024a\)Fuse, reason and verify: geometry problem solving with parsed clauses from diagram\.External Links:2407\.07327,[Link](https://arxiv.org/abs/2407.07327)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.18.18.1)\.
- M\. Zhang, F\. Yin, and C\. Liu \(2023\)A multi\-modal neural geometric solver with textual clauses parsed from diagram\.InProceedings of the Thirty\-Second International Joint Conference on Artificial Intelligence,pp\. 3374–3382\.Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p5.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.20.20.1)\.
- X\. Zhang, N\. Zhu, Y\. He, J\. Zou, Q\. Huang, X\. Jin, Y\. Guo, C\. Mao, Y\. Li, Z\. Zhu, D\. Yue, F\. Zhu, Y\. Wang, Y\. Huang, R\. Wang, C\. Qin, Z\. Zeng, S\. Xie, X\. Luo, and T\. Leng \(2024b\)FormalGeo: an extensible formalized framework for olympiad geometric problem solving\.External Links:2310\.18021,[Link](https://arxiv.org/abs/2310.18021)Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p3.1)\.
- Z\. Zhang, J\. Cheng, J\. Deng, L\. Tian, J\. Ma, Z\. Qin, X\. Zhang, N\. Zhu, and T\. Leng \(2024c\)Diagram formalization enhanced multi\-modal geometry problem solver\.External Links:2409\.04214,[Link](https://arxiv.org/abs/2409.04214)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p2.1)\.
- J\. Zhao, T\. Zhang, J\. Sun, M\. Tian, and H\. Huang \(2025\)Pi\-gps: enhancing geometry problem solving by unleashing the power of diagrammatic information\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 1526–1536\.Cited by:[§1](https://arxiv.org/html/2606.27926#S1.p2.1),[§2](https://arxiv.org/html/2606.27926#S2.p1.1),[§2](https://arxiv.org/html/2606.27926#S2.p2.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.27.27.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.28.28.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang, J\. Zhou, and J\. Lin \(2025\)Group sequence policy optimization\.External Links:2507\.18071,[Link](https://arxiv.org/abs/2507.18071)Cited by:[§2](https://arxiv.org/html/2606.27926#S2.p3.1)\.
- J\. Zhu, W\. Wang, Z\. Chen, Z\. Liu, S\. Ye, L\. Gu, H\. Tian, Y\. Duan, W\. Su, J\. Shao, Z\. Gao, E\. Cui, X\. Wang, Y\. Cao, Y\. Liu, X\. Wei, H\. Zhang, H\. Wang, W\. Xu, H\. Li, J\. Wang, N\. Deng, S\. Li, Y\. He, T\. Jiang, J\. Luo, Y\. Wang, C\. He, B\. Shi, X\. Zhang, W\. Shao, J\. He, Y\. Xiong, W\. Qu, P\. Sun, P\. Jiao, H\. Lv, L\. Wu, K\. Zhang, H\. Deng, J\. Ge, K\. Chen, L\. Wang, M\. Dou, L\. Lu, X\. Zhu, T\. Lu, D\. Lin, Y\. Qiao, J\. Dai, and W\. Wang \(2025\)InternVL3: exploring advanced training and test\-time recipes for open\-source multimodal models\.External Links:2504\.10479,[Link](https://arxiv.org/abs/2504.10479)Cited by:[§4](https://arxiv.org/html/2606.27926#S4.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.27926#S4.T1.3.8.8.1)\.

## Appendix ATechnical appendices and supplementary material

### A\.1Experimental and implementation details

The symbolic method Inter\-GPS\[Luet al\.,[2021](https://arxiv.org/html/2606.27926#bib.bib1)\]and Auto\-GPS\[Pinget al\.,[2026](https://arxiv.org/html/2606.27926#bib.bib10)\]were reproduced using the authors’ open\-source code, while results for GeoDRL\[Penget al\.,[2023](https://arxiv.org/html/2606.27926#bib.bib5)\]and E\-GPS\[Wuet al\.,[2024](https://arxiv.org/html/2606.27926#bib.bib2)\]were extracted from original publications due to code unavailability\. All experiments were executed on an Intel Xeon Gold 6530 CPU platform in conjunction with two NVIDIA L40S GPUs\. The trainable multimodal formalizer is initialized from QwenVL3\-2B; the prompted QwenVL3\-2B rows use the same base model without task\-specific adaptation\. A strict timeout threshold of 600 seconds was imposed on symbolic solvers, where any computation exceeding this duration was systematically categorized as resolution failure\. For MLLM hyperparameters, we setmax\_tokens=4096,temperature=0\.1, andtop\-p=1\.0\.

#### Backbone\.

TheSD\-GPSformalizer is initialized from QwenVL3\-2B, and all reportedSD\-GPSvariants use this 2B multimodal backbone unless otherwise specified\. The theorem\-proposing agent uses Qwen3\.6\-Plus\. The QwenVL3\-2B entries in the baseline tables denote the prompted base model without solver\-driven adaptation\.

#### Input normalization\.

For each example, the diagram is resized while preserving aspect ratio and is paired with the raw problem statement\. We do not expose OCR tokens, detected bounding boxes, or manually curated point sets toSD\-GPSduring inference\. Text is normalized only for Unicode variants of mathematical symbols and whitespace; numerical values, point labels, and algebraic expressions are otherwise kept unchanged to avoid leaking solver\-side corrections into the formalizer\.

#### Formal representation\.

We adopt an InterGPS\-style typed formal language to represent geometric problems in a solver\-executable form\. The formalizer emits a structured sequence of typed predicates, explicitly separating entity declarations, relational constraints, and query targets\. Each problem instance is represented as a directed set of well\-typed logical atoms over geometric primitives\. This InterGPS\-style representation enforces strict type consistency between geometric objects and operators, preventing invalid compositions such as applying length functions to angles or mixing segment\-level and point\-level relations\. Before being passed to the solver, a lightweight structural parser performs validation of: arity constraints, type compatibility, duplicate entity declarations, illegal symbol usage, and unresolved references\.

### A\.2Inference\-time verification and repair

At inference time,SD\-GPSuses a bounded verify\-repair loop\. The first pass generates a candidate formalization\. The parser and solver then return structured diagnostics, including missing entity references, ill\-typed predicates, contradictory constraints, unsolved targets, and timeout states\. If the candidate fails before a valid proof state is constructed, the model is prompted to repair only the formal predicates while preserving the original problem image and text\. We allow at most two repair attempts per example\. The final answer is accepted only when it is produced from a solver\-executable formalization; otherwise the example is marked unresolved\. This policy keeps the reported accuracy tied to verifiable execution rather than unconstrained natural\-language rationales\.

### A\.3Three\-way split of point\-reference ambiguity

To analyze how much information is explicitly available from the problem text, we divide examples according to the relation between text\-mentioned point names and diagram point instances\. We first extract the point setPtextP\_\{\\mathrm\{text\}\}that is explicitly mentioned in the problem statement\. The extractor is constrained by the diagram point vocabularyPdiagP\_\{\\mathrm\{diag\}\}, so it can only return valid diagram point names and cannot infer unmentioned points from geometry or common sense\. The split is then defined as follows:

- •Text\-complete:Ptext=PdiagP\_\{\\mathrm\{text\}\}=P\_\{\\mathrm\{diag\}\}\. The text explicitly names all diagram points needed by the visual instance\.
- •Diagram\-augmented:Ptext⊂PdiagP\_\{\\mathrm\{text\}\}\\subset P\_\{\\mathrm\{diag\}\}\. The text names only a strict subset of diagram points, so the diagram supplies additional entities required for formalization\.
- •Cross\-modal mismatch:all remaining cases, including examples where the text refers to an underspecified figure, region, or target whose executable meaning must be resolved from the diagram and the formal constraints together\.

Figure[3](https://arxiv.org/html/2606.27926#A1.F3)shows one representative example from each split\. These examples are used only for qualitative illustration; the quantitative ablation in Table[2](https://arxiv.org/html/2606.27926#S4.T2)is computed over the full test split\.

![Refer to caption](https://arxiv.org/html/2606.27926v1/x3.png)Figure 3:Representative examples for the three\-way split of point\-reference ambiguity\. The classification compares point names explicitly recoverable from the problem text with point instances present in the diagram, then evaluates whether the resulting formal language can resolve the target query\.
### A\.4Additional theorem\-proposing agent details

The theorem\-proposing agent is invoked only after the formalization is executable but the solver cannot close the proof within the timeout or theorem\-search budget\. The agent observes four signals: the normalized predicate set, the current solver agenda, the unresolved target, and the latest failure diagnostic\. We serialize this information as a compact proof\-state summary rather than a full search tree, which keeps the prompt short and prevents the model from relying on irrelevant derivations\.

Each proposal follows a restricted schema:

τ=\(type,premises,conclusion,entities,rationale\)\.\\tau=\(\\mathrm\{type\},\\mathrm\{premises\},\\mathrm\{conclusion\},\\mathrm\{entities\},\\mathrm\{rationale\}\)\.\(10\)The type field specifies whether the proposal is, for example, a similarity relation, cyclicity relation, equal\-angle relation, proportionality relation, or auxiliary construction\. The entity field must use only point, line, circle, or expression identifiers already declared inΛ^\\hat\{\\Lambda\}, unless the proposal type explicitly permits a bounded auxiliary construction\. This schema makes the proposal easy to parse and prevents free\-form natural\-language suggestions from entering the solver\.

A proposal is accepted only if it passes four checks: well\-typedness under the formal language, applicability to the current proof state, symbolic verification of the proposed conclusion from admissible premises, and goal reduction under the solver’s search procedure\. Unverified proposals are discarded and do not affect the final answer\. Verified proposals are treated as temporary lemmas scoped to the current problem instance rather than permanent additions to the global theorem library, and the verifier stores the corresponding local certificate or construction trace\. This design allows the system to bridge instance\-specific deductive gaps while preserving the solver’s role as the authority for correctness\.

### A\.5OCR failure modes in cascaded pipelines

The gap between fully automatic OCR\-based pipelines and their non\-OCR counterparts is not only a matter of small transcription noise\. In geometry diagrams, labels are often tiny, slanted, partially occluded by arrows or tick marks, and visually entangled with nearby symbolic marks\. Under these conditions, OCR errors frequently alter the symbolic identity of an entity or the algebraic form of a constraint, which then propagates directly into the formal language used by the downstream solver\.

Figure[4](https://arxiv.org/html/2606.27926#A1.F4)shows representative OCR failures on geometry diagram crops\. The first two cases illustrate expression\-level corruption: in panel \(a\), the length expressionx \+ 1is read asx1, effectively deleting the operator; in panel \(b\),3x \- 4is misread as3x’ \- 4, injecting an extra stroke into the symbolic item\. The remaining four cases illustrate label\-level confusion\. Lowercase letters can be mapped to digits or visually similar glyphs, e\.g\.,b→\\rightarrow6in panel \(c\),q→\\rightarrow9in panel \(d\), the point labelI→\\rightarrow/in panel \(e\), and the line labell→\\rightarrowein panel \(f\)\.

These failures are especially damaging in cascaded geometry systems because OCR output is typically treated as an authoritative symbol table for subsequent parsing and reasoning\. Once a segment length is written asx1instead ofx \+ 1, or a line label is changed fromltoe, downstream modules no longer operate on a slightly noisy observation; they operate on a different formal problem\. The symbolic solver can verify consistency only with respect to the corrupted predicates it receives, so it has very limited ability to recover from these front\-end mistakes\. This explains why OCR\-based variants in the main results degrade much more sharply than end\-to\-end approaches that avoid explicit OCR as an intermediate bottleneck\.

![Refer to caption](https://arxiv.org/html/2606.27926v1/x4.png)Figure 4:Representative OCR failures and their propagated symbolic consequences in cascaded geometry pipelines\. Panels \(a\)–\(b\) show expression\-level corruption, including operator deletion and spurious stroke insertion\. Panels \(c\)–\(f\) show label\-level confusion, where small line or point labels are misread as digits or visually similar glyphs\. In each case, the OCR output is not merely cosmetically wrong; it induces an incorrect symbolic item that changes the formal problem passed to the downstream solver\.
Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing

Similar Articles

BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

MathAtlas: A Benchmark for Autoformalization in the Wild

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Submit Feedback

Similar Articles

BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction
VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification
MathAtlas: A Benchmark for Autoformalization in the Wild
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation