VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

arXiv cs.AI 06/15/26, 04:00 AM Papers
geometry question-generation verification llm synthetic-data multimodal-reasoning
Summary
VeriGeo introduces a controllable geometry question generation framework that uses verification-guided reflection to ensure numerical and analytical consistency. The method produces high-quality synthetic data, achieving state-of-the-art results on GeoQA and strong performance on PGPS9K and MathVista-GPS.
arXiv:2606.14176v1 Announce Type: new Abstract: Geometry problem generation is useful for AI-assisted education and multimodal mathematical reasoning, but reliable synthesis remains difficult because the problem statement, diagram, constraints, and solution should be mutually consistent. Existing methods often trade off controllability and reliability: seed-based rewriting is flexible but weakly verifiable, whereas diagram-first construction improves validity but is less suited to arbitrary user-specified constraints. We introduce VeriGeo, a controllable geometry generation framework grounded in executable reasoning traces. Given user constraints such as target concepts and difficulty, an Author agent generates a problem and diagram, and a Solver agent produces a proof-aligned solution. Both agents use a shared action sequence that connects natural language, diagrams, geometric constraints, and proof steps into a verifiable representation. A three-stage pipeline checks numerical consistency, analytical realizability, and global consistency, using verification-guided reflection to repair recoverable failures and reject unrecoverable ones. Across five LLM backbones, raw generations frequently fail these checks, while VeriGeo repairs a substantial fraction of the invalid attempts. Supervised fine-tuning on 8.7k examples generated by VeriGeo achieves the best reported GeoQA performance among end-to-end multimodal LLM-based solvers, and obtains strong results on PGPS9K and MathVista-GPS, demonstrating the effectiveness of verified synthetic data for improving multimodal geometry reasoning.
Original Article
View Cached Full Text
Cached at: 06/15/26, 09:11 AM
# VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification
Source: [https://arxiv.org/html/2606.14176](https://arxiv.org/html/2606.14176)
Xiaoxian Duan1,2Zequn Liu2,Yingce Xia2,∗ 1Institute of Automation, Chinese Academy of Sciences, Beijing, China 2Zhongguancun Academy, Beijing, China duanxiaoxian2026@ia\.ac\.cn,liuzequn@bza\.edu\.cn,xiayingce@bza\.edu\.cn

###### Abstract

Geometry problem generation is useful for AI\-assisted education and multimodal mathematical reasoning, but reliable synthesis remains difficult because the problem statement, diagram, constraints, and solution should be mutually consistent\. Existing methods often trade off controllability and reliability: seed\-based rewriting is flexible but weakly verifiable, whereas diagram\-first construction improves validity but is less suited to arbitrary user\-specified constraints\. We introduce VeriGeo, a controllable geometry generation framework grounded in executable reasoning traces\. Given user constraints such as target concepts and difficulty, an Author agent generates a problem and diagram, and a Solver agent produces a proof\-aligned solution\. Both agents use a shared action sequence that connects natural language, diagrams, geometric constraints, and proof steps into a verifiable representation\. A three\-stage pipeline checks numerical consistency, analytical realizability, and global consistency, using verification\-guided reflection to repair recoverable failures and reject unrecoverable ones\. Across five LLM backbones, raw generations frequently fail these checks, while VeriGeo repairs a substantial fraction of the invalid attempts\. Supervised fine\-tuning on 8\.7k examples generated by VeriGeo achieves the best reported GeoQA performance among end\-to\-end multimodal LLM\-based solvers, and obtains strong results on PGPS9K and MathVista\-GPS, demonstrating the effectiveness of verified synthetic data for improving multimodal geometry reasoning\.

## 1Introduction

Geometry serves as an important benchmark for reasoning capabilities, presenting unique challenges in both the education of students and the training of Large Language Models \(LLMs\)\(Kazemiet al\.,[2023](https://arxiv.org/html/2606.14176#bib.bib47); Chenet al\.,[2021](https://arxiv.org/html/2606.14176#bib.bib2); Trinhet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib45); Zhanget al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib66)\)\. Unlike general textual tasks, where the context is self\-contained within the words, geometric problem\-solving is inherently multimodal: it demands that a solver cross\-reference textual conditions with visual constraints in a corresponding diagram\(Kazemiet al\.,[2023](https://arxiv.org/html/2606.14176#bib.bib47); Chenet al\.,[2021](https://arxiv.org/html/2606.14176#bib.bib2); Seoet al\.,[2015](https://arxiv.org/html/2606.14176#bib.bib42); Luet al\.,[2021](https://arxiv.org/html/2606.14176#bib.bib43)\)\. Consequently, high\-quality geometry data is difficult to synthesize because the problem statement, diagram, symbolic constraints, and solution are expected to be mutually consistent\(Chenet al\.,[2021](https://arxiv.org/html/2606.14176#bib.bib2); Seoet al\.,[2015](https://arxiv.org/html/2606.14176#bib.bib42); Luet al\.,[2021](https://arxiv.org/html/2606.14176#bib.bib43); Fuet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib52)\)\. Large\-scale supervision is either prohibitively expensive to curate manually or, when synthesized, suffers from compromises that degrade its reliability\(Fuet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib52); Chenet al\.,[2021](https://arxiv.org/html/2606.14176#bib.bib2); Luet al\.,[2021](https://arxiv.org/html/2606.14176#bib.bib43); Trinhet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib45)\)\.

An ideal geometric question generator should exhibit controllability, verifiability, and diversity\(Fuet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib52); Zhanget al\.,[2025a](https://arxiv.org/html/2606.14176#bib.bib7)\), but existing approaches typically emphasize only part of this goal\.Diagram\-first approachesfirst construct a geometric configuration, often using symbolic languages, formal graphs, or theorem\-grounded construction rules, and then formulate questions based on the generated structure\(de Mouraet al\.,[2015](https://arxiv.org/html/2606.14176#bib.bib61); Fuet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib52); Denget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib51)\)\. This improves validity, but anchoring generation to a pre\-constructed diagram makes it difficult to flexibly satisfy arbitrary user\-specified constraints, such as target concepts, difficulty levels, or diagram requirements\(Singhalet al\.,[2014](https://arxiv.org/html/2606.14176#bib.bib46); Fuet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib52)\)\.Seed\-based generation methodsinstead use LLMs to rewrite or modify existing questions\(Yuet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib48); Zhouet al\.,[2023](https://arxiv.org/html/2606.14176#bib.bib64); Caiet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib49)\)\. They are flexible and convenient for producing variants, but their rewrite\-based process can introduce hallucinated constraints and cross\-modal inconsistencies among the problem statement, diagram, and solution, which are difficult to detect and repair\(Zhouet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib56); Caiet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib49)\)\. Moreover, their diversity remains bounded by the seed distribution\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib3); Yuet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib48)\)\.

To address the aforementioned limitations, we propose VeriGeo, a generalizable agent\-based framework for geometric question generation with enhanced verification capabilities\. VeriGeo comprises an Author agent, which generates questions \(consisting of textual descriptions and corresponding diagrams\) based on user\-defined constraints, and a Solver agent designed to solve them\. Both agents operate through shared executable action sequences that connect natural language, diagrams, geometric constraints, and proof steps into verifiable representations\. Based on this executable representation, VeriGeo performs three complementary verification stages\. First, numerical verification executes the action sequence and checks local geometric consistency during construction, such as whether a claimed collinearity or perpendicularity relation holds within tolerance\. Second, analytical verification compiles geometric constraints into algebraic systems to test whether the configuration is geometrically realizable\. Third, LLM\-assisted logical verification audits global consistency among the problem text, diagram, action sequence, and solution, including contradictory assumptions, unsupported inferences, and missing cases\.

Experiments across five LLM backbones show that VeriGeo substantially improves generation validity while supporting fine\-grained control over difficulty and geometry concepts\. Raw LLM generations are rarely reliable without verification, with an average direct\-pass rate of only 29\.02% across backbones\. Gemini\-3\.1\-Pro, Qwen3\.5\-Plus, and Claude\-Opus\-4\.6 recover 36\.00%, 30\.67%, and 20\.22% of generations through verification\-guided repair, respectively, showing that repair is a major source of verified data rather than a minor post\-processing step\. In addition to improving validity, VeriGeo also broadens the conceptual coverage of generated geometry data, covering 354 distinct geometry concepts in 100 samples and surpassing both manually curated datasets and prior diagram\-first or seed\-based generation pipelines\. Beyond intrinsic validity, we further evaluate the usefulness of the generated data by supervised fine\-tuning Qwen2\.5\-VL\-7B\-Instruct\. Training on only 8\.7k verified VeriGeo examples yields strong performance on standard multimodal geometry benchmarks\. Specifically, the resulting model reaches 59\.40%, 82\.74%, and 75\.96% accuracy on PGPS9K, GeoQA, and MathVista\-GPS\. To the best of our knowledge, VeriGeo achieves the best reported GeoQA performance among MLLM\-based geometry solvers\. Among prior geometry data generation methods that train MLLMs via supervised fine\-tuning, VeriGeo also obtains the best reported results on PGPS9K and MathVista\-GPS\.

Our contribution can be summarized as follows: \(1\)*Controllable geometry generation\.*We introduce VeriGeo, a closed\-loop framework that synthesizes multimodal geometry problems and proof\-aligned solutions from user\-specified constraints, including difficulty, target concepts, and diagram requirements\. \(2\)*Verification\-guided reliability\.*VeriGeo grounds generation in executable action sequences and verifies each instance through numerical, analytical, and logical checks, enabling automatic repair of invalid text–diagram–solution inconsistencies\. \(3\)*Verified data with empirical utility\.*Experiments across five LLM backbones show improved generation validity and fine\-grained controllability\. Fine\-tuning on 8\.7k verified examples further yields competitive performance on standard multimodal geometry benchmarks\.

![Refer to caption](https://arxiv.org/html/2606.14176v1/x1.png)Figure 1:The workflow of VeriGeo comprises two primary components: an Author Agent for question generation and a Solver Agent for answer generation\. Each agent employs a three\-step verification process consisting of numerical, analytical, and logical checks\. A reflection mechanism \(denoted as “R” in the figure\) is triggered specifically when verification fails\. All the prompts used in this paper are provided in the supplementary materials\.
## 2Methodology

### 2\.1Overview

We formulate geometry question generation as the synthesis of a questionQQand a solutionSS, subject to user\-defined constraintsCC\. The questionQ=\(T,D\)Q=\(T,D\)consists of a natural language problem statementTTand a corresponding diagramDD\. The constraintsCCspecify controllable attributes, such as difficulty level and required geometry concepts\.

Upon receivingCC, the framework first synthesizes a blueprintBBby sampling from a predefined library of geometry concepts and difficulty rules\. An Author Agent, conditioned onBB, generates the textual statementTTand constructs the diagramDDthrough a series of actions \(denoted asAA\)\. To improve the correctness of the generated questions, we apply a three\-step verification process: numerical, analytical, and logical verification\. Any failure at these stages triggers a reflection mechanism, prompting the Author Agent to repair the content\. Subsequently, a Solver Agent generates the solutionSSbased onQQ, utilizing a process similar to that of the Author Agent\.

### 2\.2Blueprint Generation

The blueprintBBis generated based on user constraint through LLM, which contains geometry concepts, difficulty levels, diagram complexity, etc\. We maintain a manually curated library of geometry concepts spanning Euclidean geometry \(245245geometry concepts\), vector geometry \(7171geometry concepts\), and function\-augmented geometry \(110110geometry concepts\)\. We consider three difficulty levels:easy,medium, andhard\. For each difficulty level, we provide three representative examples Appendix[G](https://arxiv.org/html/2606.14176#A7)and users can replace them with domain\- or curriculum\-specific examples when adapting VeriGeo to new educational settings\.

### 2\.3Author Agent for Question Generation

Guided by the blueprintBB, the Author Agent first synthesizes the natural language problem statementTT\. Subsequently, it generates a sequence of executable actions to construct the corresponding diagramDD\. These action sequences undergo a three\-step verification process; if any errors are detected, the agent triggers a self\-reflection mechanism to refine the output\.

⊳\\rhdStep 1: Action generation and numerical verification\.\(Full action list in Appendix[D](https://arxiv.org/html/2606.14176#A4)\.\)

As illustrated in Figure[1](https://arxiv.org/html/2606.14176#S1.F1), the Author agent translates the textTTinto an executable action sequenceA=\{at\}tA=\\\{a\_\{t\}\\\}\_\{t\}\. Each action is defined as a tupleat:=\(op,type,args\)a\_\{t\}:=\(\\texttt\{op\},\\texttt\{type\},\\texttt\{args\}\), where: \(1\)opdenotes the primary operation category with 11 types\. For example,AddPoint/AddCircleinserts a point/circle into the diagram;MovePointmoves one point to specific position in the reflection stage\. \(2\)typespecifies the concrete method for the operation\. For instance, anAddPointoperation may use the typeFree\(for an unconstrained point\) orCartesian\(for a point with explicit coordinates\)\. \(3\)argscontains the parameters required to execute the specific operation\. For example, a Cartesian point might require arguments such as\["P", "0", "0"\]\.

Upon generating the full sequence, the system executes the script to render the diagram\. This execution process inherently performs numerical verification\. As each operation is processed, the system checks if the geometric constraints hold numerically against the current coordinate state\. For example, as illustrated in Figure[1](https://arxiv.org/html/2606.14176#S1.F1), if the script asserts a constraint such as Collinear\(O, P, T\) but the coordinates ofTTderived from previous steps do not lie on lineOPOP\(within tolerance\), the execution triggers a numerical violation error\. To further improve numerical robustness, we represent generated quantities into human\-readable exact forms, such as rationals \(e\.g\.,"\-2/3"\), radicals \(e\.g\.,"\\sqrt\(2\)"\), with details in Appendix[D\.2](https://arxiv.org/html/2606.14176#A4.SS2)\.

⊳\\rhdStep 2: Analytical verification\.\(Full verification list in Appendix[E\.1](https://arxiv.org/html/2606.14176#A5.SS1)\.\)

Numerical verification is efficient but insufficient, as floating\-point inaccuracies may allow incorrect configurations to pass within tolerance\. To improve correctness, we apply analytical verification to ensure the construction is mathematically realizable\.

\(1\) Analytic Representation\.We first define a set of scalar variables𝒱\\mathcal\{V\}representing the diagram’s degrees of freedom\. Each non\-fixed pointPPis assigned variable coordinates\(xP,yP\)\(x\_\{P\},y\_\{P\}\), while geometric primitives are assigned their necessary intrinsic parameters \(e\.g\., a circle is defined by its center coordinates and radiusrr\)\.

\(2\) Constraint Mapping\.The system is constructed by iterating over the graph constraints induced by the actions in Step 1\. The operationopcould impose specific geometric constraints, which are compiled into a set of algebraic equationsℰ\\mathcal\{E\}\. For instance, an operationCollinear\(A, B, C\) implies that the determinant of their coordinate matrix must vanish, yielding the equation:

\(xB−xA\)\(yC−yA\)−\(yB−yA\)\(xC−xA\)=0\.\(x\_\{B\}\-x\_\{A\}\)\(y\_\{C\}\-y\_\{A\}\)\-\(y\_\{B\}\-y\_\{A\}\)\(x\_\{C\}\-x\_\{A\}\)=0\.
\(3\) Global Solving and Conflict Resolution\.We usesympy\(Meureret al\.,[2017](https://arxiv.org/html/2606.14176#bib.bib65)\)Python package to derive the analytical solution forℰ\\mathcal\{E\}\. It is important to note that the resulting systemℰ\\mathcal\{E\}is frequently ill\-posed: either over\-determined due to redundant construction steps or under\-determined due to rigid body motion invariance\. To address this, we enhance the stability of the equations by implementing a series of engineering techniques, such as gauge fixing and rank\-aware filtering \(see Appendix[E\.1](https://arxiv.org/html/2606.14176#A5.SS1)–\-[E\.3](https://arxiv.org/html/2606.14176#A5.SS3)\)\. A successful convergence produces a valid coordinate realization𝒱∗\\mathcal\{V\}^\{\*\}, certifying that the diagram is geometrically realizable\. If the solver fails to converge within the specified tolerance, the instance is flagged as invalid and sent for repair\.

⊳\\rhdStep 3: Logical verification\.

Logical verification employs an LLM\-as\-a\-judge paradigm to determine whether the reasoning contains logical flaws, mainly from two perspectives: \(i\) auditing whether the action sequence and the problem specification are logically self\-consistent; and \(ii\) inspecting the question and solution logic for latent defects, including contradictory assumptions, invalid inferences, etc\.

⊳\\rhdSelf\-reflection

If a failure occurs during verification, the system triggers an immediate self\-reflection mechanism\. By incorporating the error signals \(e\.g\., the failing step and the reason\), the module prompts the LLM to autonomously diagnose the issue and decide whether to revise the geometric question or the corresponding action sequence\.

### 2\.4Solver Agent for Solution Generation

The Solver Agent synthesizes the solutionSSfor the generated questionQ=\(T,D\)Q=\(T,D\)\. This agent follows a procedure similar to that of the Author Agent: first, it generates a textual solutionSS\. Next, based onSS, it generates action sequences that must pass numerical, analytical and global verification\. We considerSSa valid answer only if all checks pass; otherwise, a reflection step is triggered\. Three specific features are worth noting:

\(1\) Auxiliary Line Construction: We allow the Solver Agent to introduce auxiliary lines through dedicated operations, including but not limited toAddAuxLine\.

\(2\) Stepwise Derivation Check: For each logical step in the proof \(e\.g\., claiming a specific angle equality or length relationship\), the agent asserts a corresponding verifiable predicate\. It immediately validates these claims by evaluating them against the computed coordinates\.

\(3\) During problem solving, the Solver may introduce additional constraints into the graph to make implicit relations explicit and to support subsequent derivations\. After the full solution is constructed, we obtain an augmented graph that encodes both the original problem constraints and those added by the Solver\. We then perform an analytical verification pass by traversing this augmented graph and checking global solvability/consistency\.

## 3Experiments

### 3\.1Experimental Settings

For the Author and Solver agents, we evaluated five LLM backbones:gemini\-3\.1\-pro\-preview,claude\-opus\-4\-6,qwen3\.5\-plus,gpt\-5\.4, andgpt\-5\.4\-mini\. For all backbones, we settemperature=1andmax\_output\_tokens=60000\. We generated problems across three distinct geometric categories, includingEuclid geometry,Vector coordinatesandFunction calculus\. Each category is further stratified into three difficulty levels: easy, medium, and hard\. We allowed a maximum of11reflection rounds whenever a verification failure is triggered for both author agent and solver agent\. We conducted experiments to evaluate the generated geometric problems across three fundamental dimensions: verifiability, controllability, and diversity\.

### 3\.2Evaluation of Verifiability

We first assess the efficacy of our verification mechanism in validating and repairing generated data\. To this end, we conduct a fixed\-budget generation experiment with450450independent attempts for each LLM backbone, covering three geometric categories and three difficulty levels with5050attempts in each category–difficulty cell\. Unlike a quota\-based setting, we do not continue generation until a predefined number of valid questions is obtained; instead, each attempt is counted once and assigned to its final verification outcome\.

For each attempt, we categorize the final outcome into three mutually exclusive groups: \(1\) Direct Pass, where the initial generation passes all verification checks; \(2\) Repaired, where the attempt fails at least one verification stage but is successfully corrected through verification\-guided reflection; and \(3\) Rejected, where the attempt still fails after the maximum allowable reflection rounds\. LetNtotalN\_\{\\mathrm\{total\}\}denote the number of generation attempts, which is450450for each backbone and5050for each category–difficulty cell\. For each outcome categoryc∈\{DirectPass,Repaired,Rejected\}c\\in\\\{\\mathrm\{Direct\\ Pass\},\\mathrm\{Repaired\},\\mathrm\{Rejected\}\\\}, we computeRc=NcNtotalR\_\{c\}=\\frac\{N\_\{c\}\}\{N\_\{\\mathrm\{total\}\}\}, whereNcN\_\{c\}is the number of attempts assigned to categorycc\.

Table[1](https://arxiv.org/html/2606.14176#S3.T1)reports the distribution of verification outcomes across five LLM backbones\. The results show that raw LLM generations remain insufficiently reliable without verification: the averageDirect Passrate is29\.02%29\.02\\%, with the weakest model,gpt\-5\.4\-mini, achieving only2\.44%2\.44\\%direct pass\. In contrast, verification\-guided reflection recovers a substantial fraction of otherwise invalid attempts, with an averageRepairedrate of25\.78%25\.78\\%across backbones\. The strongest model,gemini\-3\.1\-pro\-preview, achieves54\.22%54\.22\\%direct pass and further repairs36\.00%36\.00\\%of attempts, leaving only9\.78%9\.78\\%rejected\. Similarly,qwen3\.5\-plusandclaude\-opus\-4\-6repair30\.67%30\.67\\%and20\.22%20\.22\\%of attempts, respectively, showing that repair is a major source of verified data rather than a minor post\-processing step\. At the same time, the high rejection rates of weaker backbones, especiallygpt\-5\.4\-miniwith88\.00%88\.00\\%rejected andgpt\-5\.4with58\.00%58\.00\\%rejected, indicate that verification remains essential for filtering unrecoverable failures\.

Table[1](https://arxiv.org/html/2606.14176#S3.T1)also shows that verification yield varies substantially across backbones under different cost profiles\.gemini\-3\.1\-pro\-previewachieves the highest verified yield, with only 9\.78% of attempts rejected, but its amortized cost per accepted question remains higher than that ofqwen3\.5\-plusandgpt\-5\.4\. By contrast,qwen3\.5\-plusprovides the lowest estimated cost per verified instance in our setting, whilegpt\-5\.4\-mini, despite its low total cost, remains inefficient after amortization because of its high rejection rate\. These results highlight the importance of accounting for verification outcomes when comparing the practical efficiency of different generation backbones\.

ModelDirect PassRepairedRejectedAvg\. TokensAvg\. Time \(s\)Avg\. CostTotal CostGemini 3\.1 Pro54\.22%36\.00%9\.78%62,733449\.8$0\.2876$116\.75Qwen3\.5\-Plus38\.22%30\.67%31\.11%165,812860\.0$0\.0570$17\.66Claude Opus 4\.640\.67%20\.22%39\.11%23,087121\.3$0\.5470$149\.89GPT\-5\.49\.56%32\.44%58\.00%33,62479\.2$0\.1617$30\.56GPT\-5\.4 mini2\.44%9\.56%88\.00%34,30730\.6$0\.1691$9\.13

Table 1:Verification outcomes and generation costs across five LLM backbones\. Direct Pass, Repaired, and Rejected denote attempts that pass directly, pass after repair, or fail verification\. Average statistics are reported per accepted question, and Total Cost reports the total API cost\. See Table[A1](https://arxiv.org/html/2606.14176#A2.T1)in Appendix[B](https://arxiv.org/html/2606.14176#A2)for results by geometry category and difficulty level\.To further analyze the contribution of each verification component, we report the failure detection rate for the numerical, analytical, and logical verification modules\. Specifically, for a given modulemm, we defineNmN\_\{m\}as the number of verification failure events primarily detected by modulemm, including failures that are later repaired\. LetNaccN\_\{\\mathrm\{acc\}\}denote the number of final accepted attempts\. Failure detection rate is defined asRm=Nm/\(Nm\+Nacc\)\.R\_\{m\}=N\_\{m\}/\(N\_\{m\}\+N\_\{\\mathrm\{acc\}\}\)\.Results are reported in Table[3](https://arxiv.org/html/2606.14176#S3.T3)\.

We can observe that the modules capture complementary failure modes\. Numerical verification detects a large fraction of invalid attempts for most backbones, especiallygpt\-5\.4\-mini\(83\.28%83\.28\\%\),gpt\-5\.4\(59\.70%59\.70\\%\),claude\-opus\-4\-6\(55\.30%55\.30\\%\), andqwen3\.5\-plus\(54\.88%54\.88\\%\), suggesting that many generation errors first appear as locally inconsistent geometric constructions or numerical constraints\. Analytical verification further identifies failures that are not caught by local execution, with particularly high detection rates forgpt\-5\.4\-mini\(76\.32%76\.32\\%\) andgpt\-5\.4\(40\.38%40\.38\\%\), highlighting the importance of checking global geometric realizability\. Logical verification also contributes non\-trivially, especially forgemini\-3\.1\-pro\-preview\(29\.51%29\.51\\%\), indicating that some errors arise from higher\-level inconsistencies among the problem statement, diagram, action sequence, and solution rather than from numerical or algebraic infeasibility alone\.

ModelNumericalAnalyticalLogicalclaude\-opus\-4\-655\.30%23\.25%8\.36%gemini\-3\.1\-pro\-preview13\.43%20\.08%29\.51%gpt\-5\.459\.70%40\.38%18\.18%gpt\-5\.4\-mini83\.28%76\.32%10\.00%qwen3\.5\-plus54\.88%4\.02%12\.92%Table 2:Failure detection rates of numerical, analytical, and logical verification across five LLM backbones\. Rates indicate where verification failures are first detected\. See Table[A2](https://arxiv.org/html/2606.14176#A2.T2)in Appendix[B](https://arxiv.org/html/2606.14176#A2)for extended results\.
EasyMediumHardSet 196\.08%93\.41%86\.75%Set 298\.04%96\.70%84\.34%Set 393\.14%95\.60%78\.31%Table 3:Target\-difficulty matching rates \(evaluated by Qwen3\.5\-Plus\) on generated geometry problems under three different few\-shot demonstration sets\.

### 3\.3Evaluation of Controllability

Controllability of difficulty level\.To evaluate whether the generated problems follow the intended difficulty control, we use Qwen3\.5\-Plus as an independent few\-shot difficulty judge\. For each generated problem, the model is asked to classify its difficulty into one of the predefined levels: easy, medium, or hard\. We construct three different few\-shot demonstration sets for the judge and report the target\-difficulty matching rate, i\.e\., the proportion of generated problems whose judged difficulty matches the intended target difficulty\. Set 1 uses the same examples as those used for question generation, while Sets 2 and 3 use two independently constructed demonstration sets to test whether the evaluation is robust to the choice of few\-shot examples\.

As shown in Table[3](https://arxiv.org/html/2606.14176#S3.T3), the matching rates remain consistently high across different few\-shot demonstration sets\. This indicates that the difficulty labels of the generated problems are largely recoverable by an independent judge and are not overly sensitive to a single choice of few\-shot examples\. The matching rates are especially strong for easy and medium problems, while hard problems show relatively lower agreement, suggesting that harder instances have more ambiguous or complex difficulty boundaries\. We further visualize the distribution of geometric concepts across the three difficulty levels in Figure[A1](https://arxiv.org/html/2606.14176#A2.F1)and observe that harder problems tend to involve more geometric concepts, which is consistent with the intended difficulty control and aligns with our intuition\.

#### Controllability of geometry concepts\.

We further demonstrate the fine\-grained control of VeriGeo over specific geometry concepts through a case study\. We curated a subset of 20 Euclidean geometry concepts and tasked the model to generate 100 problems with a strict constraint: every problem must incorporate thePythagorean theorem, while additional geometry concepts could be freely selected from the pool\. We then employed an LLM to extract the underlying geometry concepts from the generated outputs\.

![Refer to caption](https://arxiv.org/html/2606.14176v1/figs/control_new.png)Figure 2:Controllability of geometry concepts and seed\-conditioned generation\. \(a\) geometry concept distribution under explicit concept constraints\. \(b\) Knowledge similarity between generated problems and their corresponding seeds\. \(c\) Distributional shift of problem complexity under difficulty\-controlled, seed\-conditioned generation on MathVista \(Harder / Original / Equivalent\)\.As shown in Figure[2](https://arxiv.org/html/2606.14176#S3.F2)\(a\), the Pythagorean theorem is detected in all generated questions, confirming that the model reliably follows the specified concept constraint\. The generated problems also incorporate additional concepts from the candidate pool\. In particular, center–chord perpendicularity and tangent–radius perpendicularity appear with relatively high frequency\. This pattern is consistent with common pedagogical practice, where the Pythagorean theorem is often combined with circle\-related perpendicularity relations to form multi\-step geometry problems\.

#### Generation conditioned on seed questions\.

The ability to generate new problem variants based on existing seed questions is crucial for educational applications\. To evaluate this capability, we selected 100 source problems from the MathVistaLuet al\.\([2024](https://arxiv.org/html/2606.14176#bib.bib50)\)dataset and employed VeriGeo to synthesize variants conditioned on these seeds\.

We first evaluate the semantic similarity between the newly generated variants and seed questions\. For each seed\-variant pair, we use an LLM\-as\-a\-judge protocol to assign a knowledge\-similarity score in\[0,1\]\[0,1\]\(details in Appendix[H](https://arxiv.org/html/2606.14176#A8)\)\. As shown in Figure[2](https://arxiv.org/html/2606.14176#S3.F2)\(b\), most variants receive high scores, suggesting that VeriGeo generates new instances while retaining the seed’s geometric knowledge\.

Furthermore, we evaluate whether VeriGeo can modulate difficulty relative to a given seed question\. For each MathVista seed, we instruct the system to generate either a harder variant or an equivalent\-difficulty variant while preserving the seed’s underlying geometric theme\. We use an independent pairwise difficulty\-relation judge and report the target\-difficulty matching rate, defined as the proportion of generated variants whose judged relation to the seed matches the requested relation \(see Appendix[H](https://arxiv.org/html/2606.14176#A8)for details\)\. VeriGeo achieves a100%100\\%matching rate \(Wilson 95% CIWilson \([1927](https://arxiv.org/html/2606.14176#bib.bib70)\): \[96\.3%, 100\.0%\]\) when asked to generate harder variants and an80\.0%80\.0\\%matching rate \(Wilson 95% CI: \[71\.1%, 86\.7%\]\) when asked to generate equivalent\-difficulty variants\. This suggests that increasing difficulty is easier to control since “preserving equivalent difficulty” is a bit more ambiguous\.

Complementing this judge\-based evaluation, we further extract the geometry concept distribution of the generated outputs in Figure[2](https://arxiv.org/html/2606.14176#S3.F2)\(c\)\. The difficulty\-increased group shifts toward higher concept counts than the original seeds, whereas the difficulty\-maintained group closely follows the original distribution\. Together, these results show that VeriGeo can control seed\-conditioned problem complexity both at the judged difficulty level and at the structural level of geometry concepts, while retaining the seed’s thematic context\.

### 3\.4Evaluation of Diversity

To assess the diversity of our generated data, we analyze the geometry concept coverage against existing geometric question datasets\. We randomly sampled 100 instances from VeriGeo and several baseline datasets, including: \(1\) manually curated datasets,geometry3kLuet al\.\([2021](https://arxiv.org/html/2606.14176#bib.bib43)\)andMathVistaLuet al\.\([2024](https://arxiv.org/html/2606.14176#bib.bib50)\), \(2\) diagram\-first approaches,GeomVerseKazemiet al\.\([2023](https://arxiv.org/html/2606.14176#bib.bib47)\)andTR\-GeoMMDenget al\.\([2025](https://arxiv.org/html/2606.14176#bib.bib51)\), \(3\) seed\-based approaches,GeoGPT4VCaiet al\.\([2024](https://arxiv.org/html/2606.14176#bib.bib49)\)andGeo170KGaoet al\.\([2025](https://arxiv.org/html/2606.14176#bib.bib3)\)\. We utilized the same LLM\-based extraction method used in the controllability experiments to identify the unique geometric concepts in each sample\.

Table[4](https://arxiv.org/html/2606.14176#S3.T4)reports the total count of distinct geometry concepts covered by each dataset\. The results demonstrate that VeriGeo encompasses a significantly broader spectrum of geometric concepts compared to prior works\. The diversity of diagram\-first approaches is constrained by the the space of reliably renderable diagrams\. Seed\-based approaches expand the coverage by leveraging LLMs, yet their diversity is still bounded by the concept distribution of the seed questions\. Notably, VeriGeo achieves the highest concept coverage, surpassing even the manually curated datasets, suggesting that our framework can systematically explore a wider concept space, thereby offering better alignment with real\-world mathematical education requirements\.

Manually CuratedDiagram\-First ApproachesSeed\-based GenerationOursgeometry3kMathVistaGeomVerseTR\-GeoMMGeoGPT4VGeo170KVeriGeo\# Geometry concept169201116126177187354

Table 4:Comparison of geometry concept coverage across datasets\.
### 3\.5Evaluation of downstream training

To evaluate whether the data generated by VeriGeo is useful beyond question generation, we further use it for downstream supervised fine\-tuning\. We fine\-tuneQwen2\.5\-VL\-7B\-Instructon all verified examples generated by VeriGeo \(8\.7k instances in total\)\. We focus on supervised fine\-tuning and leave reinforcement learning or process\-level optimization for future work\.

We evaluate the fine\-tuned model on three standard multimodal geometry reasoning benchmarks: PGPS9KZhanget al\.\([2023](https://arxiv.org/html/2606.14176#bib.bib1)\), GeoQAChenet al\.\([2021](https://arxiv.org/html/2606.14176#bib.bib2)\), and MathVista\-GPSLuet al\.\([2024](https://arxiv.org/html/2606.14176#bib.bib50)\)\. The results are reported in Table[5](https://arxiv.org/html/2606.14176#S3.T5)\. Fine\-tuning on the verified VeriGeo data substantially improves the base model across all three benchmarks\. Compared withQwen2\.5\-VL\-7B\-Instruct, VeriGeo achieving59\.40%59\.40\\%,82\.74%82\.74\\%, and75\.96%75\.96\\%on the three benchmarks, respectively\. On GeoQA, VeriGeo achieves the best reported result among end\-to\-end MLLM\-based geometry solvers\.We note that prior neural\-symbolic systems\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.14176#bib.bib4); Pinget al\.,[2026](https://arxiv.org/html/2606.14176#bib.bib5)\)report higher GeoQA scores, but they rely on formal symbolic systems, external formalization, or deductive reasoning engines, and are therefore not directly comparable to our setting\. On PGPS9K and MathVista\-GPS, VeriGeo also obtains the best reported results among prior geometry data generation methods that train MLLMs via supervised fine\-tuning\. Notably, these gains are obtained with only 8\.7k verified training examples, substantially fewer than prior geometry data\-generation methods such as G\-LLaVA, MAVIS, TR\-CoT, and GeoGen\-SFT, which typically use tens or hundreds of thousands of examples\. These results suggest that the verified data generated by VeriGeo provides effective supervision for multimodal geometry reasoning, and that data quality and verifiability can be as important as data scale\.

Table 5:Supervised fine\-tuning of Qwen2\.5\-VL\-7B\-Instruct on 8\.7k verified examples generated by VeriGeo\. All numbers are accuracy \(%\)\. “–” indicates unavailable results under the same benchmark protocol\. Data Scale refers to task\-specific math/geometry SFT data when available\. See Table[A3](https://arxiv.org/html/2606.14176#A2.T3)in Appendix[B](https://arxiv.org/html/2606.14176#A2)for an extended comparison with additional baselines and model tuning parameters\.ModelParamsPGPS9KGeoQAMathVista\-GPSData ScaleOpen\-source general MLLMsInternVL2\-76B\(Wanget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib11)\)76B––67\.80–Qwen2\.5\-VL\-7B\-Instruct7B42\.4072\.0753\.37–Open\-source math\-tuned MLLMsMath\-LLaVA\-13B\(Shiet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib8); Wanget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib11)\)13B–47\.8057\.70360KMathGLM\-Vision\-9B\(Yanget al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib9)\)9B––64\.42MathVL \+ VQAMath\-PUMA\-Qwen2\(Wanget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib11)\)7B/8B––48\.10–MathCoder\-VL\-8B\(Wanget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib11)\)8B––73\.608\.6M \+ 3MSFT\-only geometry data\-generation methodsG\-LLaVA\-7B\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib3); Denget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib51)\)7B–62\.8053\.40170KG\-LLaVA\-13B\(Gaoet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib3); Panet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib6)\)13B–67\.0056\.70170KMAVIS\-7B\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.14176#bib.bib7); Panet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib6)\)7B–68\.3064\.10834KTR\-CoT\-8B\(Denget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib51); Panet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib6)\)8B–75\.9073\.1087KGeoGen\-SFT\-3B\(Panet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib6)\)3B44\.7075\.6064\.50224KGeoGen\-SFT\-7B\(Panet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib6)\)7B54\.3078\.0074\.00224KTR\-CoT\-Qwen2\.5\-VL\-7B\(Denget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib51)\)7B–79\.20–†65K \+ Geo170KTR\-CoT\-InternVL2\.5\-8B\(Denget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib51)\)8B–76\.70–†65K \+ Geo170KVeriGeo7B59\.4082\.7475\.968\.7k

†TR\-CoT reports MathVista results under its own geometry\-problem evaluation setting, but does not explicitly report the exact MathVista\-GPS split used in our evaluation; therefore, we do not copy those numbers into the MathVista\-GPS column\.

## 4Related Work

We organize prior work into*Diagram\-First Approaches*and*Seed\-based Generation*\.

Diagram\-First Approaches\.Diagram\-first methods treat the geometric configuration as the source of truth and derive text and solutions from that structured state\. Early educational generators enumerate figures, instantiate templates, and validate solvability via automated deduction\(Singhalet al\.,[2014](https://arxiv.org/html/2606.14176#bib.bib46)\)\. Recent work strengthens the substrate and verification loop: GeomVerse uses procedural construction for controlled evaluation\(Kazemiet al\.,[2023](https://arxiv.org/html/2606.14176#bib.bib47)\), while MAVIS synthesizes diagram–text–reasoning triples at scale\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.14176#bib.bib7)\)\. Theorem/constraint\-grounded pipelines further tighten correctness, e\.g\., TR\-CoT and TrustGeoGen\(Denget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib51); Fuet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib52)\), and unified generators such as GeoUni aim to jointly model diagram, problem text, and solutions under concept control\(Chenget al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib53)\)\. Symbolic\-first solvers are also diagram\-first in spirit: AlphaGeometry operates over a formal language and couples synthetic theorem/proof generation with sound deduction\(Trinhet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib45)\)\. Diagram\-first pipelines provide strong*verifiability*, but their*diversity*and*controllability*are often constrained by engineered primitives and template libraries\.

Seed\-based Generation\.Seed\-based generation expands datasets by rewriting or evolving existing problems, enabling fast linguistic diversification with limited domain engineering\. General recipes include Self\-Instruct and Evol\-Instruct\-style mutation\(Wanget al\.,[2023](https://arxiv.org/html/2606.14176#bib.bib22); Xuet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib62)\), with math\-specialized variants such as WizardMath reinforcing multi\-step reasoning\(Luoet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib58)\)\. In multimodal learning, Visual Instruction Tuning \(e\.g\., LLaVA\) similarly scales supervision from image–text seeds\(Liuet al\.,[2023](https://arxiv.org/html/2606.14176#bib.bib63)\)\. In geometry, GeoGPT4V and G\-LLaVA build large corpora of diagram\-aligned problems\(Caiet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib49); Gaoet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib3)\), while GeoThought emphasizes step\-wise reasoning traces \(and reflection\)\(Shiet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib55)\)\. Seed\-based generation is also used to synthesize*harder evaluation*beyond static benchmarks, as in CHASE\(Patelet al\.,[2025](https://arxiv.org/html/2606.14176#bib.bib57)\)and robustness transformations such as MathCheck\(Zhouet al\.,[2024](https://arxiv.org/html/2606.14176#bib.bib56)\)\. This paradigm offers convenient*control*via targeted edits, but its distribution is anchored to the seed corpus; moreover, without an executable representation, it has weaker*verifiability*and is vulnerable to hallucinations and silent solvability/text–diagram inconsistencies\.

## 5Conclusions

We introduced VeriGeo, a controllable framework for generating multimodal geometry problems with verified diagrams and proof\-aligned solutions\. VeriGeo grounds both question generation and solution generation in a shared executable action sequence, allowing natural\-language statements, diagrams, geometric constraints, and reasoning steps to be checked within a unified representation\. Experiments across five LLM backbones show that raw generations are often unreliable, while verification\-guided repair substantially increases the yield of valid problems\. The generated data also exhibits fine\-grained controllability over target difficulty and geometry concepts, while covering a broader range of concepts than prior manually curated, diagram\-first, or seed\-based datasets\. Beyond intrinsic generation quality, supervised fine\-tuning on only 8\.7k verified VeriGeo examples leads to strong performance on standard multimodal geometry benchmarks, including the best reported GeoQA result among end\-to\-end multimodal LLM\-based solvers\.

## References

- S\. Cai, K\. Bao, H\. Guo, J\. Zhang, J\. Song, and B\. Zheng \(2024\)GeoGPT4V: towards geometric multi\-modal large language models with geometric image generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 750–766\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.44),[Link](https://aclanthology.org/2024.emnlp-main.44/)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.14176#S3.SS4.p1.1),[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- J\. Chen, J\. Tang, J\. Qin, X\. Liang, L\. Liu, E\. Xing, and L\. Lin \(2021\)GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,Online,pp\. 513–523\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.findings-acl.46),[Link](https://aclanthology.org/2021.findings-acl.46/)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p1.1),[§3\.5](https://arxiv.org/html/2606.14176#S3.SS5.p2.3)\.
- J\. Cheng, Z\. Zhang, R\. Chen, J\. Deng, Z\. Qin, and J\. Ma \(2025\)GeoUni: a unified model for generating geometry diagrams, problems and problem solutions\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 3057–3066\.External Links:[Document](https://dx.doi.org/10.1145/3746027.3754965),[Link](https://doi.org/10.1145/3746027.3754965)Cited by:[§4](https://arxiv.org/html/2606.14176#S4.p2.1)\.
- L\. de Moura, S\. Kong, J\. Avigad, F\. van Doorn, and J\. von Raumer \(2015\)The lean theorem prover \(system description\)\.InAutomated Deduction – CADE\-25,Lecture Notes in Computer Science, Vol\.9195,pp\. 378–388\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-21401-6%5F26)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p2.1)\.
- L\. Deng, L\. Zhu, Y\. Liu, Y\. Wang, Q\. Xie, J\. Wu, G\. Zhang, Y\. Zhu, and X\. Bai \(2025\)Theorem\-validated reverse chain\-of\-thought problem generation for geometric reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 718–735\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.38),[Link](https://aclanthology.org/2025.emnlp-main.38/)Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.2.2.2.2),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.10.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.14.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.17.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.18.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.19.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.22.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.3.2),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.31.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.34.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.8.1),[§1](https://arxiv.org/html/2606.14176#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.14176#S3.SS4.p1.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.1.1.1.2),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.13.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.16.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.2.2),[§4](https://arxiv.org/html/2606.14176#S4.p2.1)\.
- D\. Fu, Z\. Chen, R\. Xia, Q\. Liu, Y\. Feng, H\. Zhou, R\. Zhang, S\. Feng, P\. Gao, J\. Yan, B\. Shi, B\. Zhang, and Y\. Qiao \(2025\)TrustGeoGen: scalable and formal\-verified data engine for trustworthy multi\-modal geometric problem solving\.External Links:2504\.15780,[Document](https://dx.doi.org/10.48550/arXiv.2504.15780),[Link](https://arxiv.org/abs/2504.15780)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p1.1),[§1](https://arxiv.org/html/2606.14176#S1.p2.1),[§4](https://arxiv.org/html/2606.14176#S4.p2.1)\.
- J\. Gao, R\. Pi, J\. Zhang, J\. Ye, W\. Zhong, Y\. Wang, L\. Hong, J\. Han, H\. Xu, Z\. Li, and L\. Kong \(2025\)G\-llava: solving geometric problem with multi\-modal large language model\.InInternational Conference on Learning Representations,Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.31.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.32.1),[§1](https://arxiv.org/html/2606.14176#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.14176#S3.SS4.p1.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.13.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.14.1),[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- M\. Kazemi, H\. Alvari, A\. Anand, J\. Wu, X\. Chen, and R\. Soricut \(2023\)GeomVerse: a systematic evaluation of large models for geometric reasoning\.External Links:2312\.12241,[Document](https://dx.doi.org/10.48550/arXiv.2312.12241),[Link](https://arxiv.org/abs/2312.12241)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p1.1),[§3\.4](https://arxiv.org/html/2606.14176#S3.SS4.p1.1),[§4](https://arxiv.org/html/2606.14176#S4.p2.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 34892–34916\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf)Cited by:[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao \(2024\)MathVista: evaluating mathematical reasoning of foundation models in visual contexts\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 23439–23554\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/663bce02a0050c4a11f1eb8a7f1429d3-Paper-Conference.pdf)Cited by:[§3\.3](https://arxiv.org/html/2606.14176#S3.SS3.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.14176#S3.SS4.p1.1),[§3\.5](https://arxiv.org/html/2606.14176#S3.SS5.p2.3)\.
- P\. Lu, R\. Gong, S\. Jiang, L\. Qiu, S\. Huang, X\. Liang, and S\. Zhu \(2021\)Inter\-GPS: interpretable geometry problem solving with formal language and symbolic reasoning\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 6774–6786\.External Links:[Link](https://aclanthology.org/2021.acl-long.528/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.528)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p1.1),[§3\.4](https://arxiv.org/html/2606.14176#S3.SS4.p1.1)\.
- H\. Luo, Q\. Sun, C\. Xu, P\. Zhao, J\. Lou, C\. Tao, X\. Geng, Q\. Lin, S\. Chen, Y\. Tang, and D\. Zhang \(2025\)WizardMath: empowering mathematical reasoning for large language models via reinforced evol\-instruct\.InInternational Conference on Learning Representations,Y\. Yue, A\. Garg, N\. Peng, F\. Sha, and R\. Yu \(Eds\.\),Vol\.2025,pp\. 49573–49609\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/file/7c04aea54c2a60a632a47bd451cd2849-Paper-Conference.pdf)Cited by:[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- A\. Meurer, C\. P\. Smith, M\. Paprocki, O\. Čertík, S\. B\. Kirpichev, M\. Rocklin, A\. Kumar, S\. Ivanov, J\. K\. Moore, S\. Singh, T\. Rathnayake, S\. Vig, B\. E\. Granger, R\. P\. Muller, F\. Bonazzi, H\. Gupta, S\. Vats, F\. Johansson, F\. Pedregosa, M\. J\. Curry, A\. R\. Terrel, Š\. Roučka, A\. Saboo, I\. Fernando, S\. Kulal, R\. Cimrman, and A\. Scopatz \(2017\)SymPy: symbolic computing in python\.PeerJ Computer Science3,pp\. e103\.External Links:[Document](https://dx.doi.org/10.7717/peerj-cs.103),[Link](https://doi.org/10.7717/peerj-cs.103)Cited by:[§2\.3](https://arxiv.org/html/2606.14176#S2.SS3.p9.3)\.
- Y\. Pan, Z\. Zhang, P\. Hu, J\. Ma, J\. Du, J\. Zhang, Q\. Liu, J\. Gao, and F\. Ma \(2025\)Enhancing the geometric problem\-solving ability of multimodal llms via symbolic\-neural integration\.InProceedings of the 33rd ACM International Conference on Multimedia,External Links:[Document](https://dx.doi.org/10.1145/3746027.3754571),[Link](https://doi.org/10.1145/3746027.3754571)Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.18.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.22.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.32.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.33.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.34.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.35.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.36.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.14.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.15.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.16.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.17.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.18.1)\.
- A\. Patel, S\. Reddy, and D\. Bahdanau \(2025\)How to get your LLM to generate challenging problems for evaluation\.arXiv preprint arXiv:2502\.14678\.External Links:[Link](https://arxiv.org/abs/2502.14678)Cited by:[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- S\. Peng, D\. Fu, L\. Gao, X\. Zhong, H\. Fu, and Z\. Tang \(2024\)MultiMath: bridging visual and mathematical reasoning for large language models\.arXiv preprint arXiv:2409\.00147\.External Links:2409\.00147,[Link](https://arxiv.org/abs/2409.00147)Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.1.1.1.2)\.
- B\. Ping, M\. Luo, Z\. Dang, C\. Wang, and C\. Jia \(2026\)AutoGPS: automated geometry problem solving via multimodal formalization and deductive reasoning\.InInternational Conference on Learning Representations,Cited by:[§3\.5](https://arxiv.org/html/2606.14176#S3.SS5.p2.3)\.
- M\. Seo, H\. Hajishirzi, A\. Farhadi, O\. Etzioni, and C\. Malcolm \(2015\)Solving geometry problems: combining text and diagram interpretation\.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1466–1476\.External Links:[Document](https://dx.doi.org/10.18653/v1/D15-1171),[Link](https://aclanthology.org/D15-1171/)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p1.1)\.
- N\. Shi, C\. Qin, S\. Song, and M\. Luo \(2025\)GeoThought: a dataset for enhancing mathematical geometry reasoning in vision\-language models\.External Links:2510\.21881,[Link](https://arxiv.org/abs/2510.21881)Cited by:[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- W\. Shi, Z\. Hu, Y\. Bin, J\. Liu, Y\. Yang, S\. Ng, L\. Bing, and R\. K\. Lee \(2024\)Math\-llava: bootstrapping mathematical reasoning for multimodal large language models\.arXiv preprint arXiv:2406\.17294\.Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.25.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.8.1)\.
- R\. Singhal, M\. Henz, and K\. McGee \(2014\)Automated generation of geometry questions for high school mathematics\.InProceedings of the 6th International Conference on Computer Supported Education \(CSEDU\),pp\. 14–25\.External Links:[Document](https://dx.doi.org/10.5220/0004795300140025)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p2.1),[§4](https://arxiv.org/html/2606.14176#S4.p2.1)\.
- T\. H\. Trinh, Y\. Wu, Q\. V\. Le, H\. He, and T\. Luong \(2024\)Solving olympiad geometry without human demonstrations\.Nature625\(7995\),pp\. 476–482\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06747-5),[Link](https://www.nature.com/articles/s41586-023-06747-5)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p1.1),[§4](https://arxiv.org/html/2606.14176#S4.p2.1)\.
- K\. Wang, J\. Pan, L\. Wei, A\. Zhou, W\. Shi, Z\. Lu, H\. Xiao, Y\. Yang, H\. Ren, M\. Zhan, and H\. Li \(2025\)MathCoder\-vl: bridging vision and code for enhanced multimodal mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 2505–2534\.Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.1.1.1.2),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.11.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.12.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.13.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.14.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.16.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.17.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.18.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.19.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.20.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.21.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.25.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.27.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.28.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.29.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.6.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.7.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.9.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.10.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.11.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.5.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.8.1)\.
- Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi \(2023\)Self\-instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 13484–13508\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.754),[Link](https://aclanthology.org/2023.acl-long.754/)Cited by:[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- E\. B\. Wilson \(1927\)Probable inference, the law of succession, and statistical inference\.Journal of the American Statistical Association22\(158\),pp\. 209–212\.External Links:ISSN 01621459, 1537274X,[Link](http://www.jstor.org/stable/2276774)Cited by:[§3\.3](https://arxiv.org/html/2606.14176#S3.SS3.SSS0.Px2.p3.2)\.
- C\. Xu, Q\. Sun, K\. Zheng, X\. Geng, P\. Zhao, J\. Feng, C\. Tao, Q\. Lin, and D\. Jiang \(2024\)WizardLM: empowering large pre\-trained language models to follow complex instructions\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 30745–30766\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/82eec786fdfbbfa53450c5feb7d1ac92-Paper-Conference.pdf)Cited by:[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- Z\. Yang, J\. Chen, Z\. Du, W\. Yu, W\. Wang, W\. Hong, Z\. Jiang, B\. Xu, and J\. Tang \(2024\)MathGLM\-vision: solving mathematical problems with multi\-modal large language model\.arXiv preprint arXiv:2409\.13729\.Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.26.1),[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.8.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.9.1)\.
- L\. Yu, W\. Jiang, H\. Shi, J\. Yu, Z\. Liu, Y\. Zhang, J\. T\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2024\)MetaMath: bootstrap your own mathematical questions for large language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2309.12284)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p2.1)\.
- J\. Zhang, Z\. Li, M\. Zhang, F\. Yin, C\. Liu, and Y\. Moshfeghi \(2024\)GeoEval: benchmark for evaluating LLMs and multi\-modal models on geometry problem\-solving\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 1258–1276\.External Links:[Link](https://aclanthology.org/2024.findings-acl.73/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.73)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p1.1)\.
- M\. Zhang, F\. Yin, and C\. Liu \(2023\)A multi\-modal neural geometric solver with textual clauses parsed from diagram\.InProceedings of the Thirty\-Second International Joint Conference on Artificial Intelligence,External Links:[Document](https://dx.doi.org/10.24963/ijcai.2023/376)Cited by:[§3\.5](https://arxiv.org/html/2606.14176#S3.SS5.p2.3)\.
- R\. Zhang, X\. Wei, D\. Jiang, Z\. Guo, Y\. Zhang, C\. Tong, J\. Liu, A\. Zhou, S\. Zhang, P\. Gao, and H\. Li \(2025a\)MAVIS: mathematical visual instruction tuning with an automatic data engine\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=MnJzJ2gvuf)Cited by:[Table A3](https://arxiv.org/html/2606.14176#A2.T3.3.3.33.1),[§1](https://arxiv.org/html/2606.14176#S1.p2.1),[Table 5](https://arxiv.org/html/2606.14176#S3.T5.2.2.15.1),[§4](https://arxiv.org/html/2606.14176#S4.p2.1)\.
- X\. Zhang, Y\. Li, N\. Zhu, C\. Qin, Z\. Zeng, and T\. Leng \(2025b\)FGeo\-hypergnet: geometric problem solving integrating formalgeo symbolic system and hypergraph neural network\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence,pp\. 4733–4741\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2025/527)Cited by:[§3\.5](https://arxiv.org/html/2606.14176#S3.SS5.p2.3)\.
- Z\. Zhou, S\. Liu, M\. Ning, W\. Liu, J\. Wang, D\. F\. Wong, X\. Huang, Q\. Wang, and K\. Huang \(2024\)Is your model really a good math reasoner? evaluating mathematical reasoning with checklist\.External Links:2407\.08733,[Link](https://arxiv.org/abs/2407.08733)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p2.1),[§4](https://arxiv.org/html/2606.14176#S4.p3.1)\.
- Z\. Zhou, M\. Ning, Q\. Wang, J\. Yao, W\. Wang, X\. Huang, and K\. Huang \(2023\)Learning by analogy: diverse questions generation in math word problem\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11091–11104\.External Links:[Link](https://aclanthology.org/2023.findings-acl.705/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.705)Cited by:[§1](https://arxiv.org/html/2606.14176#S1.p2.1)\.

## Appendix ALimitations and Future Work

VeriGeo provides a controllable and verifiable framework for generating multimodal geometry problems, but several directions remain open for future extension\.

First, the current implementation focuses on geometry problems that can be represented by our executable action grammar and verified through numerical, analytical, and logical checks\. This design gives VeriGeo strong controllability and enables automatic rejection of inconsistent samples, but the supported problem space is still tied to the available geometric operators, constraint types, and algebraic verification routines\. Extending the action grammar to cover richer diagram types, more advanced theorem\-level constructions, three\-dimensional geometry, and curriculum\-specific variants would further broaden the applicability of the framework\.

Second, the generation efficiency of VeriGeo depends on the choice of backbone model and the verification–reflection budget\. Our experiments show that different backbones exhibit different direct\-pass, repair, rejection, time, token, and cost profiles\. This suggests several practical extensions, including adaptive model routing, early rejection of unrecoverable generations, dynamic reflection budgets, and lightweight verifier\-based pre\-filtering before invoking stronger models\.

Finally, our downstream evaluation focuses on supervised fine\-tuning with verified VeriGeo data\. The strong gains obtained from a relatively small verified dataset indicate that verification quality is an important factor for multimodal geometry reasoning\. Future work can further explore process\-level supervision, reinforcement learning, verifier\-guided data selection, and larger\-scale training across different multimodal backbones\. Beyond geometry, the same principle of executable generation followed by multi\-stage verification may also be extended to other structured multimodal reasoning domains, such as physics diagrams, charts, and symbolic visual reasoning\.

### Usage of Large Language Models

Large language models and agentic workflows are part of the proposed method in this paper\. Specifically, VeriGeo uses LLMs as Author and Solver agents for geometry problem generation and solution generation\. LLMs are also used for verification\-guided reflection and for LLM\-as\-a\-judge evaluations, including difficulty classification, pairwise difficulty comparison, geometry\-concept extraction, and knowledge\-similarity scoring\. These uses are core components of the framework and are described in the methodology and experimental setup sections\.

The authors are responsible for all scientific content in the paper, including the method design, implementation, dataset construction, experimental design, result analysis, figures, tables, and references\. All LLM\-generated outputs used in the framework were subject to the verification and evaluation procedures described in the paper\. During manuscript preparation, LLMs were used only for auxiliary editing support, such as grammar checking, stylistic polishing, and improving clarity and fluency\. No LLM or agent was treated as an author\.

## Appendix BAdditional Tables and Figures

Table A1:Fine\-grained outcome distribution and generation cost\(extension of Table[1](https://arxiv.org/html/2606.14176#S3.T1)\)\. For every authoring model we report the percentage ofDirect Pass,Accepted after Repair, andRejectedattempts on then=50n=50problems in each \(domain, difficulty\) cell\. Avg\. Tokens, Avg\. Time, and Avg\. Cost are reported per accepted problem in each cell, where the number of accepted problems is computed as50×\(Direct Pass\+Repaired\)50\\times\(\\textit\{Direct Pass\}\+\\textit\{Repaired\}\)\. Total Cost reports the total API cost incurred in the corresponding cell and sums to the model\-level total\. TheOverallrow aggregates the nine cells of each model and reproduces the model\-level numbers reported in the main text\.ModelDomainDifficultyDirect Pass \(%\)Repaired \(%\)Rejected \(%\)Avg\. TokensAvg\. Time \(s\)Avg\. CostTotal CostGemini\-3\.1\-ProEuclid\.Easy48\.040\.012\.070,908477\.6$0\.3266$14\.37Medium46\.032\.022\.080,295559\.8$0\.4395$17\.14Hard32\.026\.042\.077,074462\.1$0\.5586$16\.20Func\. & Calc\.Easy70\.028\.02\.047,040318\.8$0\.1735$8\.50Medium64\.034\.02\.050,714342\.9$0\.1963$9\.62Hard64\.034\.02\.048,035340\.0$0\.1814$8\.89Vec\. & Coord\.Easy56\.044\.00\.060,657399\.0$0\.2432$12\.16Medium56\.042\.02\.069,486569\.2$0\.2994$14\.67Hard52\.044\.04\.070,865613\.8$0\.3167$15\.20Overall54\.2236\.009\.7862,733449\.8$0\.2876$116\.75Qwen3\.5\-PlusEuclid\.Easy48\.038\.014\.0160,465842\.3$0\.0456$1\.96Medium44\.036\.020\.0132,197851\.4$0\.0438$1\.75Hard36\.036\.028\.0174,101997\.1$0\.0594$2\.14Func\. & Calc\.Easy28\.026\.046\.0179,265730\.3$0\.0674$1\.82Medium20\.020\.060\.0194,960772\.2$0\.1000$2\.00Hard22\.020\.058\.0208,910847\.3$0\.1038$2\.18Vec\. & Coord\.Easy58\.030\.012\.0156,191858\.3$0\.0425$1\.87Medium46\.036\.018\.0155,969887\.8$0\.0468$1\.92Hard42\.034\.024\.0172,440876\.9$0\.0532$2\.02Overall38\.2230\.6731\.11165,812860\.0$0\.0570$17\.66Claude\-Opus\-4\.6Euclid\.Easy38\.018\.044\.021,058105\.5$0\.5061$14\.17Medium30\.014\.056\.026,426150\.0$0\.8536$18\.78Hard26\.012\.062\.027,420155\.1$1\.0311$19\.59Func\. & Calc\.Easy56\.028\.016\.018,09484\.4$0\.2850$11\.97Medium52\.028\.020\.020,953105\.1$0\.3588$14\.35Hard50\.024\.026\.023,192122\.5$0\.4514$16\.70Vec\. & Coord\.Easy44\.022\.034\.020,67295\.4$0\.3955$13\.05Medium34\.018\.048\.026,204148\.2$0\.7431$19\.32Hard36\.018\.046\.030,152176\.2$0\.8133$21\.96Overall40\.6720\.2239\.1123,087121\.3$0\.5470$149\.89GPT\-5\.4Euclid\.Easy6\.024\.070\.031,90768\.2$0\.2240$3\.36Medium4\.018\.078\.036,378105\.3$0\.3464$3\.81Hard6\.018\.076\.035,430108\.2$0\.3192$3\.83Func\. & Calc\.Easy12\.042\.046\.032,20565\.0$0\.1078$2\.91Medium12\.038\.050\.033,69668\.6$0\.1268$3\.17Hard12\.038\.050\.035,81264\.0$0\.1344$3\.36Vec\. & Coord\.Easy12\.038\.050\.030,53596\.0$0\.1300$3\.25Medium10\.036\.054\.031,03487\.4$0\.1387$3\.19Hard12\.040\.048\.037,18177\.3$0\.1415$3\.68Overall9\.5632\.4458\.0033,62479\.2$0\.1617$30\.56GPT\-5\.4\-miniEuclid\.Easy2\.010\.088\.032,70926\.8$0\.1717$1\.03Medium4\.012\.084\.035,52732\.6$0\.1288$1\.03Hard2\.014\.084\.036,49234\.2$0\.1375$1\.10Func\. & Calc\.Easy2\.04\.094\.042,37834\.1$0\.4033$1\.21Medium2\.08\.090\.038,46533\.1$0\.2260$1\.13Hard0\.02\.098\.032,93823\.8$0\.8000$0\.80Vec\. & Coord\.Easy4\.018\.078\.032,15230\.5$0\.0855$0\.94Medium2\.010\.088\.031,11227\.1$0\.1550$0\.93Hard4\.08\.088\.031,24228\.1$0\.1600$0\.96Overall2\.449\.5688\.0034,30730\.6$0\.1691$9\.13

Table A2:Fine\-grained failure detection rates\(extension of Table[3](https://arxiv.org/html/2606.14176#S3.T3)\)\. For each \(model, domain, difficulty\) cell, we report the failure detection rateRm=Nm/\(Nm\+Nacc\)R\_\{m\}=N\_\{m\}/\(N\_\{m\}\+N\_\{\\mathrm\{acc\}\}\)for each verification stagem∈\{Numerical,Analytical,Logical\}m\\in\\\{\\text\{Numerical\},\\text\{Analytical\},\\text\{Logical\}\\\}\. Here,NmN\_\{m\}denotes the number of failure cases first detected by stagemm, including cases that are later repaired, andNaccN\_\{\\mathrm\{acc\}\}denotes the number of final accepted attempts\. TheOverallrow pools the nine cells of each model and reproduces the model\-level rates reported in the main text\.ModelDomainDifficultyRNum\.R\_\{\\text\{Num\.\}\}\(%\)RAna\.R\_\{\\text\{Ana\.\}\}\(%\)RLog\.R\_\{\\text\{Log\.\}\}\(%\)Gemini\-3\.1\-ProEuclid\.Easy0\.048\.20\.0Medium0\.045\.80\.0Hard0\.048\.20\.0Func\. & Calc\.Easy24\.60\.023\.4Medium19\.70\.031\.9Hard21\.00\.032\.9Vec\. & Coord\.Easy12\.30\.042\.5Medium14\.02\.041\.0Hard12\.70\.043\.5Overall13\.420\.129\.5Qwen3\.5\-PlusEuclid\.Easy46\.26\.514\.0Medium49\.44\.818\.4Hard52\.65\.318\.2Func\. & Calc\.Easy62\.03\.612\.9Medium71\.00\.04\.8Hard69\.14\.50\.0Vec\. & Coord\.Easy46\.32\.212\.0Medium51\.20\.012\.8Hard51\.37\.311\.6Overall54\.94\.012\.9Claude\-Opus\-4\.6Euclid\.Easy52\.534\.99\.7Medium57\.747\.60\.0Hard65\.542\.40\.0Func\. & Calc\.Easy48\.10\.020\.8Medium54\.02\.44\.8Hard53\.27\.511\.9Vec\. & Coord\.Easy52\.221\.410\.8Medium60\.027\.80\.0Hard59\.128\.90\.0Overall55\.323\.28\.4GPT\-5\.4Euclid\.Easy58\.361\.525\.0Medium67\.669\.415\.4Hard61\.370\.020\.0Func\. & Calc\.Easy62\.53\.612\.9Medium64\.316\.70\.0Hard66\.20\.03\.8Vec\. & Coord\.Easy51\.030\.634\.2Medium45\.246\.532\.4Hard55\.935\.010\.3Overall59\.740\.418\.2GPT\-5\.4\-miniEuclid\.Easy80\.080\.614\.3Medium69\.280\.00\.0Hard74\.276\.511\.1Func\. & Calc\.Easy93\.957\.10\.0Medium90\.050\.00\.0Hard97\.787\.50\.0Vec\. & Coord\.Easy69\.467\.615\.4Medium77\.881\.825\.0Hard80\.080\.60\.0Overall83\.376\.310\.0Table A3:Comparison with closed\-source MLLMs, open\-source general MLLMs, open\-source math\-tuned MLLMs, and SFT\-only geometry data\-generation methods\. All numbers are accuracy \(%\)\. “–” indicates that the result or task\-specific data scale is not reported under the same benchmark protocol\. Data Scale denotes task\-specific math/geometry SFT data when available, rather than generic pretraining data\. This table is an extension of Table[5](https://arxiv.org/html/2606.14176#S3.T5)of the main paper\.ModelParamsPGPS9KGeoQAMathVista\-GPSData ScaleClosed\-source MLLMsQwen\-VL\-Plus\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]Closed––35\.50–Qwen\-VL\-Max\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]Closed––46\.10–GPT\-4V\[[5](https://arxiv.org/html/2606.14176#bib.bib51),[27](https://arxiv.org/html/2606.14176#bib.bib9)\]Closed–43\.4050\.50–Claude\-3\-Opus\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]Closed––52\.90–Gemini Ultra\[[5](https://arxiv.org/html/2606.14176#bib.bib51)\]Closed––56\.30–GPT\-4\-Turbo\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]Closed––58\.30–Gemini\-1\.5\-Pro\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]Closed––58\.90–Claude\-3\.5\-Sonnet\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]Closed––64\.40–GPT\-4o\[[5](https://arxiv.org/html/2606.14176#bib.bib51),[23](https://arxiv.org/html/2606.14176#bib.bib11)\]Closed–61\.4064\.70–Open\-source general MLLMsLLaVA\-1\.5\-13B\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]13B––22\.70–DeepSeek\-VL\-7B\[[5](https://arxiv.org/html/2606.14176#bib.bib51),[23](https://arxiv.org/html/2606.14176#bib.bib11)\]7B–33\.7028\.40–Qwen2\-VL\-7B/8B\[[14](https://arxiv.org/html/2606.14176#bib.bib6),[5](https://arxiv.org/html/2606.14176#bib.bib51),[23](https://arxiv.org/html/2606.14176#bib.bib11)\]7B/8B35\.0055\.7040\.90–InternVL2\-8B\[[5](https://arxiv.org/html/2606.14176#bib.bib51),[23](https://arxiv.org/html/2606.14176#bib.bib11)\]8B–56\.5062\.00–InternVL2\-26B\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]26B––54\.30–InternVL2\-76B\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]76B––67\.80–InternVL2\.5\-8B\[[14](https://arxiv.org/html/2606.14176#bib.bib6),[5](https://arxiv.org/html/2606.14176#bib.bib51)\]8B37\.4059\.0067\.80–Qwen2\.5\-VL\-7B\-Instruct7B42\.4072\.0753\.37–Open\-source math\-tuned MLLMsMath\-LLaVA\-13B\[[20](https://arxiv.org/html/2606.14176#bib.bib8),[23](https://arxiv.org/html/2606.14176#bib.bib11)\]13B–47\.8057\.70360KMathGLM\-Vision\-9B\[[27](https://arxiv.org/html/2606.14176#bib.bib9)\]9B––64\.42MathVL \+ VQAMultiMath\-7B\[[16](https://arxiv.org/html/2606.14176#bib.bib10),[23](https://arxiv.org/html/2606.14176#bib.bib11)\]7B––66\.80300K‡Math\-PUMA\-Qwen2\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]7B/8B––48\.10–MathCoder\-VL\-2B\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]2B––66\.408\.6M \+ 3MMathCoder\-VL\-8B\[[23](https://arxiv.org/html/2606.14176#bib.bib11)\]8B––73\.608\.6M \+ 3MSFT\-only geometry data\-generation methodsG\-LLaVA\-7B\[[7](https://arxiv.org/html/2606.14176#bib.bib3),[5](https://arxiv.org/html/2606.14176#bib.bib51)\]7B–62\.8053\.40170KG\-LLaVA\-13B\[[7](https://arxiv.org/html/2606.14176#bib.bib3),[14](https://arxiv.org/html/2606.14176#bib.bib6)\]13B–67\.0056\.70170KMAVIS\-7B\[[31](https://arxiv.org/html/2606.14176#bib.bib7),[14](https://arxiv.org/html/2606.14176#bib.bib6)\]7B–68\.3064\.10834KTR\-CoT\-8B\[[5](https://arxiv.org/html/2606.14176#bib.bib51),[14](https://arxiv.org/html/2606.14176#bib.bib6)\]8B–75\.9073\.1087KGeoGen\-SFT\-3B\[[14](https://arxiv.org/html/2606.14176#bib.bib6)\]3B44\.7075\.6064\.50224KGeoGen\-SFT\-7B\[[14](https://arxiv.org/html/2606.14176#bib.bib6)\]7B54\.3078\.0074\.00224KTR\-CoT\-Qwen2\.5\-VL\-7B\[[5](https://arxiv.org/html/2606.14176#bib.bib51)\]7B–79\.20–†65K \+ Geo170KTR\-CoT\-InternVL2\.5\-8B\[[5](https://arxiv.org/html/2606.14176#bib.bib51)\]8B–76\.70–†65K \+ Geo170KVeriGeo\-greedy7B56\.9079\.3175\.008\.7kVeriGeo\-beam57B59\.4082\.7475\.968\.7k

†TR\-CoT reports MathVista results under its own geometry\-problem evaluation setting, but does not explicitly report the exact MathVista\-GPS split used in our evaluation; therefore, we do not copy those numbers into the MathVista\-GPS column\. ‡MultiMath includes a process\-supervised RL stage and is listed only as a math\-tuned contextual baseline, not as a directly comparable SFT\-only geometry data\-generation method\. ⊳\\rhdQwen2\.5\-VL\-7B\-Instruct fine\-tuned via LoRA \(r=4,α=8r=4,\\alpha=8\) on 8673 synthetic geometry problems for 1 epoch \(205 optimizer steps\), AdamW with learning rate1e−51e\-5, cosine schedule, 200 warmup steps, effective batch size 48, bf16\. Inference with beam search \(num\_beams=5, top\-1\)\.

![Refer to caption](https://arxiv.org/html/2606.14176v1/figs/concept_drift_new.png)Figure A1:Distribution shift of complexity under difficulty control\.![Refer to caption](https://arxiv.org/html/2606.14176v1/x2.png)Figure A2:Case study of difficulty modulation: from a one\-step triangle angle\-sum seed to a multi\-constraint angle calculation problem\. In this case, the seed problem is a one\-step angle calculation task: with two angles in a single triangle given, the target follows directly from the triangle angle\-sum rule\. In contrast, the difficulty\-increased output preserves the same angle calculation theme but embeds it in a richer configuration: two triangles share vertexCC, with additional collinearity constraints \(B,C,D,FB,C,D,FandA,C,EA,C,Eare collinear\)\. Solving now requires multiple intermediate inferences\. This adds more geometry concepts and a longer reasoning chain while staying aligned with the seed’s geometric context\.
## Appendix CVeriGeo Language Specification

This appendix provides the full grammar, supported operator types, and the stage\-specific profiles used by the Author and Solver\.

### C\.1Core data model and identifiers

VeriGeo operates over a shared, mutable geometric state containing typed entities \(e\.g\.,Point,Segment,Circle,Angle,Triangle,Quadrilateral,Function\) and constraint edges\. Each entity is referenced by a string identifier\.

#### Canonical IDs and aliases\.

VeriGeo accepts both canonical IDs and common aliases, then normalizes them before execution\. In particular:

- •Points are atomic IDs \(e\.g\.,A,P,O\)\.
- •Segments are canonicalized asSEG\_ABwith endpoints in sorted order; common aliases such asABare accepted and normalized\.
- •Triangles and quadrilaterals are referenced by explicit registry IDs \(e\.g\.,TRI\_ABC,RECT\_ABCD\) created via registry operators\.
- •Angles are referenced either by a registered angle ID \(e\.g\.,ANG\_BAC\) or by a readable alias \(e\.g\.,∠\\angleBAC\) that is normalized to the registered entity\.

### C\.2Action grammar and JSON envelopes

#### Atomic action\.

Every step is a JSON record

at≡\(op,type,args\),a\_\{t\}\\equiv\(\\texttt\{op\},\\texttt\{type\},\\texttt\{args\}\),whereopselects an operator family,typeselects a variant, andargssupplies ordered arguments \(IDs, literals, or expressions\)\. All numerical values are provided as*strings*to ensure determinism across LLM serialization and downstream parsing\.

#### EBNF \(trace\-level\)\.

```
Action   ::= {"op": Op, "type": Type, "args": [Arg*]}
Arg      ::= string
Actions  ::= [Action*]

AuthorEnvelope ::= {
  "problem": string,
  "givens": [string*],
  "goal": string,
  "answer": string,
  "comment": string,
  "actions": Actions,
  optional "solver_actions": Actions
}

SolverEnvelope ::= {
  "actions": Actions,
  "final_answer": string,
  optional "proof": string
}
```

### C\.3Stage\-specific profiles

Both stages share the*same*core semantics, but expose different operator subsets in practice\.

#### Author profile \(generative\)\.

Used to*construct*a non\-degenerate configuration and a consistent problem package\. In addition to introducing entities and constraints, the Author may perform controlled coordinate adjustments \(e\.g\.,MovePoint\) to improve realizability and avoid degeneracy\.

#### Solver profile \(verificative\)\.

Used to record a*proof\-aligned*reasoning trace: derived relations \(e\.g\.,ExprConstraint,Midpoint,Perpendicular\) plus explicit verification operators \(e\.g\.,VerifyPoint,VerifyFunction\)\. The Solver trace is executed stepwise; invalid IDs, missing prerequisites, or failing predicates are rejected immediately\.

### C\.4Numerical and expression sublanguage

Many operator arguments areScalarExprstrings\. The implementation uses a controlled parser that \(i\) preserves exact intent when possible \(e\.g\.,"1/2","sqrt\(2\)","√\\surd2"\) and \(ii\) rejects unsafe constructs when evaluating user/LLM\-provided function expressions\.

#### ScalarExpr\.

Supported literals include signed integers/decimals, rationals, radicals, and simple arithmetic compositions \(e\.g\.,"1/2 \+ sqrt\(2\)"\)\. This design avoids float\-format ambiguity in JSON and keeps traces reproducible\.

#### Function expressions \(forAddFunction/VerifyFunction\)\.

When a function is specified as"y = f\(x\)"or as an expression, it is parsed via a restricted AST: only a whitelist of arithmetic operators and elementary functions is permitted; attribute access and other unsafe nodes are disallowed\. This makes function\-based verification both robust and safe under untrusted generation\.

## Appendix DSupplementary Details for the Action Catalog

A summary of the operations is listed in Table[A4](https://arxiv.org/html/2606.14176#A4.T4)\. Details of the operator family, the supportedtypevariants, argument signatures, and effects are summarized below\. Unless stated otherwise, all IDs referenced inargsmust already exist \(after normalization\)\. This appendix also includes the \(formerly separate\) stage\-profile comparison, consolidated under the same catalog to avoid redundancy while keeping all original labels intact\.

FamilyRepresentative operators \(examples\)Point constructionAddPoint\(Cartesian/OnSegment/Free\)Primitive geometryAddAuxLine\(Segment\),AddCircle\(CenterRadius/CenterPoint\)Object registryAddAngle,AddTriangle,AddQuadrilateral\(Rectangle/Square/…\)ConstraintsAddEdge\(Perpendicular/EqualAngle/ExprConstraint/…\)Function modelingAddFunction\(Explicit\),VerifyFunction\(DerivativeAt/IntegralOn\)VerificationVerifyPoint\(Cartesian\)Table A4:Operator families in VeriGeo#### AddPoint\(point construction\)\.

- •Free:\["P"\]\. Create a new free point\.
- •Cartesian\(Author only\):\["P","x","y"\]withx,y: ScalarExpr\. Create a point with explicit coordinates\.
- •OnSegment:\["P","SEG\_AB","t"\]\. Create a point incident to an existing segment; optional seed ratiot\(clamped to\[0,1\]\[0,1\]\) initializes the embedding, while incidence is enforced by constraints\.
- •OnCircle:\["P","CIR\_O","theta"\]\. Create a point on an existing circle; optional seed angletheta\(degrees\)\.
- •PointOnCircle:\["P","CIR\_O","Q"\]\. Create a point on the circle using an existing reference pointQto define a stable initial direction\.

#### MovePoint\(Author\-only coordinate adjustment\)\.

Cartesian:\["P","x","y"\]\. Update the embedding coordinates of an existing point \(used to repair degeneracy or improve layout\); constraints remain authoritative and will be re\-certified globally\.

#### AddAuxLine\(auxiliary segment\)\.

Segment:\["A","B"\]\. Register \(or reuse\) the segment between two existing points, creating the canonical IDSEG\_ABafter normalization\.

#### AddCircle\(primitive geometry\)\.

- •CenterRadius:\["CIR\_O","O","r"\]\. Create a circle by center and radius\.
- •CenterPoint:\["CIR\_O","O","A"\]\. Create a circle by center and a point on the circle \(radius derived fromOA\)\.

#### Registry operators \(explicit object creation\)\.

These operators make composite objects first\-class, enabling downstream constraints and expressions to reference them by ID\.

- •AddAngle\(ByPoints\):\["ANG","B","A","C"\]registers∠BAC\\angle BACunder IDANG\.
- •AddTriangle\(ByPoints\):\["TRI","A","B","C"\]registers triangleABCABCunder IDTRI\.
- •AddQuadrilateral:\["QID","A","B","C","D"\]with one of the followingtypevariants:Trapezoid,Parallelogram,Rectangle,Square,Rhombus\(and other supported subtypes\)\. The registry operator also ensures boundary segments \(AB,BC,CD,DA\) exist and adds the defining constraints implied by the type\.

#### AddFunction\(function modeling\)\.

Explicit:\["FUNC\_f", "y = f\(x\)", "x\_min", "x\_max", "samples"\]\. Registers an explicit real function \(parsed from the right\-hand side if"y="is provided\) and optional sampling range used for rendering and numerical checks\.

#### AddEdge\(constraint edges\)\.

AddEdgeregisters a typed constraint edge over an existing scope\. The backend supports a broad family of constraints; in*strict*mode, a selected subset is compiled into equation systems \(Appendix[E\.1](https://arxiv.org/html/2606.14176#A5.SS1)\), while the remaining relations are validated by certified numerical checks and semantic passes \(Appendices[F](https://arxiv.org/html/2606.14176#A6)and[E\.4](https://arxiv.org/html/2606.14176#A5.SS4)\)\. Commontypevariants include:

- •Segment relations\(SEG\_\.\.or point\-pair forms are accepted and normalized\):Perpendicular,Parallel,EqualLength\. Signature:\[seg1, seg2\]\.
- •Angle relations:EqualAngle\. Signature:\[ang1, ang2\]\.
- •Incidence / collinearity / locus:Collinear\(\[P1,P2,P3,\.\.\.\]\),PointOnLine\(\[P, SEG\_AB\]\),PointOnSegment\(\[P, SEG\_AB, t\]\),OnCircle\(\[P, CIR\_O\]\)\. Internally, these are normalized into a uniform incidence representation with optional parameters \(e\.g\., segment ratio\)\.
- •Midpoint / angle bisector:Midpoint\(\[M, SEG\_AB\]or\[M, A, B\]\),AngleBisector\(\[SEG\_AE, ANG\_BAC\]and common shorthands that normalize to this form\)\.
- •Cyclic / concyclic:Cyclic\(\[A,B,C,D\]\) is supported as an alias that is normalized to concyclicity checks and/or circle\-incidence constraints depending on whether an explicit circle object exists\.
- •Triangle relations:SimilarTriangles,CongruentTriangles\. Signature:\[TRI\_ABC, TRI\_DEF\]\.
- •Vector constraints\(coordinate geometry\): - –VectorSum:\[A,B, C,D, E,F\]encodingAB→=CD→\+EF→\\overrightarrow\{AB\}=\\overrightarrow\{CD\}\+\\overrightarrow\{EF\}\. - –VectorDiff:\[A,B, C,D, E,F\]encodingAB→=CD→−EF→\\overrightarrow\{AB\}=\\overrightarrow\{CD\}\-\\overrightarrow\{EF\}\. - –VectorDotZero:\[A,B, C,D\]encodingAB→⋅CD→=0\\overrightarrow\{AB\}\\cdot\\overrightarrow\{CD\}=0\. - –VectorLinear:\[A,B, k1,C1,D1, k2,C2,D2, \.\.\.\]encodingAB→=∑ikiCiDi→\\overrightarrow\{AB\}=\\sum\_\{i\}k\_\{i\}\\,\\overrightarrow\{C\_\{i\}D\_\{i\}\}\.
- •Algebraic constraints:ExprConstraint\. Signature:\[lhs, rel, rhs, tol\], whererelis one of=, ==, \!=, <=, \>=, <, \>andtolis an optionalScalarExpr\.
- •Advanced relations\(validated by certified checks when present\):Tangent,PowerEquality, and other higher\-level theorem\-backed relations used by the Solver trace and/or derived\-constraint closure\.

#### VerifyPoint\(explicit coordinate check; Solver profile\)\.

Cartesian:\["P","x","y","tol"\]\. Checks that the realized point coordinates match the claimed values within tolerance\.

#### VerifyFunction\(function reasoning; Solver profile\)\.

- •PointsOnFunction:\["f", P1, P2, \.\.\., "tol"\]where eachPimay be a point ID or a coordinate pair"\(x,y\)"; verifies pointwise satisfaction ofy=f\(x\)y=f\(x\)\.
- •DerivativeAt:\["f","x0","df","n=order","tol"\]\. Numerically checks annn\-th derivative atx0\(or at a point ID via itsx\-coordinate\)\.
- •IntegralOn:\["f","a","b","val","tol"\]\. Numerically checks∫abf\(x\)𝑑x\\int\_\{a\}^\{b\}f\(x\)\\,dx; bounds may be numeric, point IDs \(using theirx\-coordinate\), or infinities \("inf","\-inf"\), with optional breakpoints for piecewise evaluation\.

### D\.1Comparison of Actions/Operators used in Author and Solver Agents

#### Stage\-specific profiles in code \(operator\-level separation\)\.

The implementation enforces*distinct allowed operator sets*for Author vs\. Solver traces, while keeping a shared execution core:

- •Author \(generative\) tracepermits diagram construction and controlled layout repair: AllowedOpsAuthor=\{\\displaystyle\\texttt\{AllowedOps\}\_\{\\textsf\{Author\}\}=\\\{AddPoint, MovePoint, AddAuxLine, AddEdge,AddCircle,AddAngle, AddTriangle,AddQuadrilateral, AddFunction\}\.\\displaystyle\\texttt\{ AddAngle, AddTriangle,AddQuadrilateral, AddFunction\}\\\}\.
- •Solver \(verificative\) traceis restricted to proof\-aligned, checkable steps: AllowedOpsSolver=\{AddEdge, AddPoint, AddAuxLine, AddFunction, VerifyPoint, VerifyFunction\}\.\\texttt\{AllowedOps\}\_\{\\textsf\{Solver\}\}=\\\{\\texttt\{AddEdge, AddPoint, AddAuxLine, AddFunction, VerifyPoint, VerifyFunction\}\\\}\.

This separation is*enforced twice*: \(i\) at LLM generation time via strict JSON\-schema response formats, and \(ii\) at runtime by a validator that rejects out\-of\-profile operators, malformed signatures, or unresolved IDs\.

#### Normalization and robustness under LLM noise\.

To make traces stable under minor formatting variation, the parser performs conservative normalizations without changing semantics:

- •Canonicalizes segment/angle IDs \(AB→\\rightarrowSEG\_AB; readable angle tokens→\\rightarrowregistered vertex triplets\)\.
- •Sanitizes argument typing: numerical fields are accepted as strings/ints/floats but are stored and processed as strings, ensuring deterministic downstream parsing\.
- •Repairs common shorthands \(e\.g\.,AddAuxLinethat redundantly includes a segment name inargs\) and then revalidates the normalized action\.

Crucially, these repairs are*not*heuristic acceptance of invalid steps: if a normalization cannot map the action into a valid canonical form, the step is rejected\.

#### Proof–action alignment as a verifiability constraint\.

The Solver prompt and validator jointly enforce that every non\-trivial numerical or algebraic conclusion in the proof must be emitted as an explicitExprConstraintstep\. This makes the reasoning trace executable, auditable, and*replayable*\(rather than a free\-form narrative\), which is essential for reliable LLM supervision and for reviewer\-facing inspection\.

#### Conservative derived\-constraint closure \(optional but impactful\)\.

Beyond user\-provided constraints, the backend optionally applies a provenance\-tagged closure pass \(source=derived\) that adds*safe*implied relations when prerequisites are already certified—e\.g\., inferringSimilarTrianglesfrom two parallel\-induced angle correspondences, or from numerically certified AA matches\. Because every derived edge is guarded by existence checks and tagged with its reason, this augmentation strengthens downstream verification without introducing silent assumptions\.

#### Minimal examples \(illustrative only\)\.

> \{"op":"AddPoint","type":"Cartesian","args":\["A","0","0"\]\}\(Author\) \{"op":"AddEdge","type":"Perpendicular","args":\["SEG\_AB","SEG\_CD"\]\}\(Both\) \{"op":"VerifyFunction","type":"DerivativeAt","args":\["FUNC\_f","x=1","2","n=1","1e\-4"\]\}\(Solver\) \{"op":"AddQuadrilateral","type":"Square","args":\["SQ\_ABCD","A","B","C","D"\]\}\(Author\)

### D\.2Human\-Readable Exact Quantities

A persistent challenge in LLM\-based geometry synthesis is the instability of float numbers: a pipeline may be logically correct yet fail verification because calculation artifacts perturb downstream checks or yield non\-canonical answers\. To address this, we canonicalize numerical quantities into human\-readable, verifiable forms: \(i\) integers/decimals \(e\.g\.,"3","0\.5"\); \(ii\) rationals \(e\.g\.,"\-2/3"\); \(iii\) radicals \(e\.g\.,"\\sqrt\(2\)"\); and \(iv\) simple arithmetic over constants \(e\.g\.,"1/2 \+ sqrt\(2\)"\)\. This approach mitigates precision errors introduced by floating\-point arithmetic and ensures the output is more user\-friendly\.

## Appendix ESupplementary Information of Analytical Verification

### E\.1Analytical Constraints

Let each pointPPhave unknowns\(xP,yP\)\(x\_\{P\},y\_\{P\}\)unless alreadyknownor closed\-form locked\. Our analytical solver compiles the following constraint types into equations:

#### Collinear\.

For each triple\(A,B,C\)\(A,B,C\)in scope:

\(xB−xA\)\(yC−yA\)−\(yB−yA\)\(xC−xA\)=0\.\(x\_\{B\}\-x\_\{A\}\)\(y\_\{C\}\-y\_\{A\}\)\-\(y\_\{B\}\-y\_\{A\}\)\(x\_\{C\}\-x\_\{A\}\)=0\.

#### Parallel\.

For segmentsAB¯\\overline\{AB\}andCD¯\\overline\{CD\}:

\(xB−xA\)\(yD−yC\)−\(yB−yA\)\(xD−xC\)=0\.\(x\_\{B\}\-x\_\{A\}\)\(y\_\{D\}\-y\_\{C\}\)\-\(y\_\{B\}\-y\_\{A\}\)\(x\_\{D\}\-x\_\{C\}\)=0\.

#### Perpendicular\.

\(xB−xA\)\(xD−xC\)\+\(yB−yA\)\(yD−yC\)=0\.\(x\_\{B\}\-x\_\{A\}\)\(x\_\{D\}\-x\_\{C\}\)\+\(y\_\{B\}\-y\_\{A\}\)\(y\_\{D\}\-y\_\{C\}\)=0\.

#### EqualLength\.

‖B−A‖2−‖D−C‖2=0\.\\\|B\-A\\\|^\{2\}\-\\\|D\-C\\\|^\{2\}=0\.

#### EqualAngle\.

Given angle tokens resolvable into vertex triplets\(A,B,C\)\(A,B,C\)and\(D,E,F\)\(D,E,F\), the engine attempts the signed\-cosine form:

\(A−B\)⋅\(C−B\)‖A−B‖‖C−B‖−\(D−E\)⋅\(F−E\)‖D−E‖‖F−E‖=0,\\frac\{\(A\-B\)\\cdot\(C\-B\)\}\{\\\|A\-B\\\|\\\|C\-B\\\|\}\-\\frac\{\(D\-E\)\\cdot\(F\-E\)\}\{\\\|D\-E\\\|\\\|F\-E\\\|\}=0,and otherwise falls back to a polynomial equality derived from dot products and squared lengths:

\(\(A−B\)⋅\(C−B\)\)2‖D−E‖2‖F−E‖2−\(\(D−E\)⋅\(F−E\)\)2‖A−B‖2‖C−B‖2=0\.\\bigl\(\(A\-B\)\\cdot\(C\-B\)\\bigr\)^\{2\}\\\|D\-E\\\|^\{2\}\\\|F\-E\\\|^\{2\}\-\\bigl\(\(D\-E\)\\cdot\(F\-E\)\\bigr\)^\{2\}\\\|A\-B\\\|^\{2\}\\\|C\-B\\\|^\{2\}=0\.

#### AngleBisector\.

For an angle token\(X,Y,Z\)\(X,Y,Z\)and a bisector segment that shares vertexYY, let the other endpoint beEE; the engine enforces:

cos⁡∠XYE−cos⁡∠EYZ=0\\cos\\angle XYE\-\\cos\\angle EYZ=0\(using the same cosine fallback rules asEqualAngle\)\.

#### VectorDotZero\.

ForAB→⋅CD→=0\\overrightarrow\{AB\}\\cdot\\overrightarrow\{CD\}=0:

\(xB−xA\)\(xD−xC\)\+\(yB−yA\)\(yD−yC\)=0\.\(x\_\{B\}\-x\_\{A\}\)\(x\_\{D\}\-x\_\{C\}\)\+\(y\_\{B\}\-y\_\{A\}\)\(y\_\{D\}\-y\_\{C\}\)=0\.

#### Midpoint\.

For midpointMMofABAB:

2xM−xA−xB=0,2yM−yA−yB=0\.2x\_\{M\}\-x\_\{A\}\-x\_\{B\}=0,\\quad 2y\_\{M\}\-y\_\{A\}\-y\_\{B\}=0\.

#### PointOnLine\.

For pointPPon line/segment with reference endpointsA,BA,B:

\(xP−xA\)\(yB−yA\)−\(yP−yA\)\(xB−xA\)=0\.\(x\_\{P\}\-x\_\{A\}\)\(y\_\{B\}\-y\_\{A\}\)\-\(y\_\{P\}\-y\_\{A\}\)\(x\_\{B\}\-x\_\{A\}\)=0\.

#### PointOnSegment \(ratio\-locked or free\)\.

If ratiorris provided:

xP−xA−r\(xB−xA\)=0,yP−yA−r\(yB−yA\)=0\.x\_\{P\}\-x\_\{A\}\-r\(x\_\{B\}\-x\_\{A\}\)=0,\\quad y\_\{P\}\-y\_\{A\}\-r\(y\_\{B\}\-y\_\{A\}\)=0\.Otherwise, a free scalar parameterttis introduced and optimized jointly with other variables under the same affine constraints\.

#### Incidence\.

If target is a circle, it reduces toOnCircle\. If target is a line/segment, it reduces toPointOnLine\.

#### OnCircle / PointOnCircle\.

Let circle center beOO; the engine usesr2r^\{2\}as: \(i\) a constant if numerical radius exists; else \(ii\)‖Q−O‖2\\\|Q\-O\\\|^\{2\}if a through\-pointQQexists; else \(iii\) a fresh symbolic parameter\. Then it enforces\(xP−xO\)2\+\(yP−yO\)2−r2=0\(x\_\{P\}\-x\_\{O\}\)^\{2\}\+\(y\_\{P\}\-y\_\{O\}\)^\{2\}\-r^\{2\}=0\.

#### Concyclic\.

If the scope is of the form\(P,𝖢𝖨𝖱\)\(P,\\mathsf\{CIR\}\)where𝖢𝖨𝖱\\mathsf\{CIR\}is an explicit circle entity, the constraint is reduced toOnCircle\. Otherwise, let the in\-scope point list be\(P1,…,Pm\)\(P\_\{1\},\\dots,P\_\{m\}\)\(non\-point tokens are ignored\)\. Whenm≥4m\\geq 4, the engine fixes the first three points\(A,B,C\)=\(P1,P2,P3\)\(A,B,C\)=\(P\_\{1\},P\_\{2\},P\_\{3\}\)as a reference and for each remaining pointD∈\{P4,…,Pm\}D\\in\\\{P\_\{4\},\\dots,P\_\{m\}\\\}enforces the determinant form:

det\(xA2\+yA2xAyA1xB2\+yB2xByB1xC2\+yC2xCyC1xD2\+yD2xDyD1\)=0,∀D∈\{P4,…,Pm\}\.\\det\\begin\{pmatrix\}x\_\{A\}^\{2\}\+y\_\{A\}^\{2\}&x\_\{A\}&y\_\{A\}&1\\\\ x\_\{B\}^\{2\}\+y\_\{B\}^\{2\}&x\_\{B\}&y\_\{B\}&1\\\\ x\_\{C\}^\{2\}\+y\_\{C\}^\{2\}&x\_\{C\}&y\_\{C\}&1\\\\ x\_\{D\}^\{2\}\+y\_\{D\}^\{2\}&x\_\{D\}&y\_\{D\}&1\\end\{pmatrix\}=0,\\qquad\\forall\\,D\\in\\\{P\_\{4\},\\dots,P\_\{m\}\\\}\.\(Equivalently, the analytical engine appendsdet\(MA,B,C,D\)\\det\(M\_\{A,B,C,D\}\)as a polynomial equation for each extra point\.\)

#### Gauge constraints\.

If fewer than two anchored points exist, the engine fixes an origin and scale: either\(xP0,yP0\)=\(0,0\)\(x\_\{P\_\{0\}\},y\_\{P\_\{0\}\}\)=\(0,0\)and\(xP1,yP1\)=\(1,0\)\(x\_\{P\_\{1\}\},y\_\{P\_\{1\}\}\)=\(1,0\), or if one anchor exists, it pins a second point to be one unit to the right on the same horizontal line\.

#### Closed\-form pre\-locking \(before global solving\)\.

Before materializing the full symbolic system, our analytical solver opportunistically locks a small set of*single\-unknown*patterns in closed form to reduce search space and avoid degenerate roots\. A representative case is a*translation\-locked*segment: if a reference segmentAB¯\\overline\{AB\}has both endpoints anchored and a target segment has one anchored endpointCCwith the other endpointDDunknown, the solver may directly set

\(xD,yD\)=\(xC,yC\)\+\(\(xB−xA\),\(yB−yA\)\),\(x\_\{D\},y\_\{D\}\)=\(x\_\{C\},y\_\{C\}\)\+\\bigl\(\(x\_\{B\}\-x\_\{A\}\),\(y\_\{B\}\-y\_\{A\}\)\\bigr\),provided thatDDis not simultaneously entangled by multiple unrelated constraints\. To keep the verification uniform, such locked results are injected as equality equations \(rather than directly writing coordinates\) whenever the point is not yet hard\-locked\.

#### Equation sanitization\.

All constraints are converted into a zero\-equality formf\(𝐳\)=0f\(\\mathbf\{z\}\)=0\. If an equation becomes constant \(no free symbols\), the solver treats it as an immediate consistency check: it is discarded when numerically satisfied, and the instance is rejected when violated\.

#### Hybrid solve with timeout and numerical fallback\.

Our analytical solver first attempts a symbolic solve under a strict time budget\. If the symbolic attempt fails, times out, or yields a degenerate assignment that cannot pass residual checks, it falls back to a numerical root finder \(e\.g\.,nsolve\) with randomized initializations and accepts the first solution that satisfies all constraints within tolerance\.

#### Underdetermined completion \(free\-parameter handling\)\.

When the system is underdetermined \(fewer independent equations than unknowns\), our analytical solver introduces auxiliary parameters \(e\.g\., forPointOnSegmentor unknownr2r^\{2\}in circles\) and assigns a deterministic fill constant to a small subset of low\-frequency free variables to obtain a concrete embedding\. Additionally, for a perpendicular constraint where the shared endpoint is still free, a geometric completion may be applied by placing the point at a Thales construction over the two known endpoints, i\.e\.,

U=M±‖A−B‖2𝐧,M=A\+B2,U=M\\pm\\frac\{\\\|A\-B\\\|\}\{2\}\\,\\mathbf\{n\},\\quad M=\\frac\{A\+B\}\{2\},with a fixed sign convention to avoid point collapse\.

#### Residual check and safe commit\.

A candidate solution is accepted only if every equation evaluates to a scalar residual below a global threshold\. Upon acceptance, the analytical solver commits each newly solved point*once*and respects protected points unless explicitly overridden; committed points are added to a hard\-lock list for downstream stages\.

#### Circle synchronization and second\-intersection repair\.

After committing coordinates, circle radii are synchronized from the current center/through\-point definition\. To address the common ambiguity of circle–line intersections \(two valid roots\), our analytical solver includes a post\-pass that detects points stuck at the anchor/root underOnCircle\+Collinear\-style constraints and moves them to the alternative intersection whenever required for consistency\.

### E\.2Rank\-aware solving and underdetermined completion

#### Component splitting\.

Given equation listℰ\\mathcal\{E\}over symbols𝒮\\mathcal\{S\}, the analytical solver builds a bipartite graph between equations and the symbols appearing in them, and solves each connected component independently\. This reduces blow\-up and allows local fallbacks\.

#### Rank estimation and independent equation selection\.

For a component, the engine estimates rank by numerically evaluating the JacobianJJat random assignments and computingrank\(J\)\\mathrm\{rank\}\(J\)\. It then greedily selects a subset of equations that increases rank to \(approximately\) the target rank, improving robustness under redundancy\.

#### Solvers and acceptance\.

The engine first calls symbolicsolve\. If it fails, it falls back tonsolvewith randomized initializations and accepts a solution iff the residual of*all*compiled equations is belowanalytic\_tol\. If a component remains underdetermined, remaining free symbols are filled deterministically withanalytic\_fill\_constant\(default0\.350\.35\), optionally attempting partial solves first \(analytic\_fill\_free\_params=True\)\.

### E\.3Closed\-form warm starts and repairs

Our analytical solver applies closed\-form placements before global solving:

- •Midpoint:M=A\+B2M=\\frac\{A\+B\}\{2\}when both endpoints are numeric\.
- •Ratio\-locked on segment:P=A\+r\(B−A\)P=A\+r\(B\-A\)whenrris provided\.
- •Triangle canonical points \(ifA,B,CA,B,Care known\): - –IncenterIIvia side\-length barycentric weights:I=aA\+bB\+cCa\+b\+cI=\\frac\{aA\+bB\+cC\}\{a\+b\+c\}witha=‖B−C‖a=\\\|B\-C\\\|,b=‖A−C‖b=\\\|A\-C\\\|,c=‖A−B‖c=\\\|A\-B\\\|\. - –A specific bisector–circumcircle second intersectionEEcomputed by intersecting theAA\-angle\-bisector ray with the circumcircle and choosing the non\-AAintersection\. - –Thales\-style pointsN,PN,Pon the circle with diameterEBEBandECEC, respectively, with a fixed sign rule to avoid coincident flips\.
- •Circle–line second intersection repair:if a point is constrained byConcyclicandCollinearand collapses to the anchor intersection, choose the farther non\-anchor intersection on the circle\.
- •Parallel single\-unknown completion:when a parallel constraint involves a segment with one unknown endpoint that participates in no other constraints, translate the known endpoint by the reference direction vector\.

### E\.4Non\-strict mode: incidence completion and semantic intersection passes

#### Intersection resolution\.

Given a pointPPconstrained by intersections of \(1\) line–line, \(2\) circle–line, or \(3\) circle–circle, Our analytical solver enumerates candidate intersection points and chooses: \(i\) a candidate within segment bounds if applicable, \(ii\) a candidate not already used by another point \(viaused\_targets\), and \(iii\) the candidate closest toPP’s previous coordinates to stabilize layouts\. A minimum separation threshold avoids multiple points collapsing to the same location\.

#### Semantic passes \(implemented as four passes\)\.

After raw intersections, the analytical solver performs semantic placements for common constructions:

- •Pass 1 \(local endpoints / midpoints\):ensure auxiliary points tied to a segment are consistent with its current endpoints and midpoint definitions\.
- •Pass 2 \(triangle foot / bisector intersection\):compute foot pointHHand bisector intersectionLLfrom available\(A,B,C\)\(A,B,C\)\-like configurations using projection / line intersection\.
- •Pass 3 \(circumcircle and tangent proxies\):compute a designated second intersectionPPof a bisector with circumcircle; computeSSas the intersection of a tangent\-at\-AAdirection withBCBC; and set pointRRon the tangent line throughAA\.
- •Pass 4 \(direction correction\):refine cached directions for segments inParallel/Perpendicularrelations to reduce accumulated drift\.

All semantic writes respecthard\_lockedpoints and do not overwrite protected points unless explicitly allowed\.

## Appendix FNumerical verification library

Numerical verification provides*deterministic, per\-constraint*residual checks over the embedded diagram produced by our analytical solver \(and its repair loops\)\. Concretely,verify\_constraints\(ang\_tol\_deg, len\_tol, circle\_tol\)returns a dictionary keyed by constraint ID, where each entry contains \(i\) a three\-valued verdictok∈\{True,False,None\}\\texttt\{ok\}\\in\\\{\\texttt\{True\},\\texttt\{False\},\\texttt\{None\}\\\}, \(ii\) a diagnostic residual stringmsg, and \(iii\) the constrainttype,scope, andsource\. We intentionally useok=Nonefor unsupported or unchecked constraint types to avoid falsely asserting correctness when a numerical predicate is not implemented\.

#### Shared geometric primitives\.

LetP=\(xP,yP\)P=\(x\_\{P\},y\_\{P\}\)denote a point coordinate,‖u‖2\\\|u\\\|\_\{2\}the Euclidean norm, andPQ→=\(xQ−xP,yQ−yP\)\\overrightarrow\{PQ\}=\(x\_\{Q\}\-x\_\{P\},\\,y\_\{Q\}\-y\_\{P\}\)\. We use: \(i\) distancesd\(P,Q\)=‖P−Q‖2d\(P,Q\)=\\\|P\-Q\\\|\_\{2\}; \(ii\) angles∠\(u,v\)=arccos⁡\(u⋅v‖u‖2‖v‖2\)\\angle\(u,v\)=\\arccos\\\!\\big\(\\frac\{u\\cdot v\}\{\\\|u\\\|\_\{2\}\\\|v\\\|\_\{2\}\}\\big\)in degrees, with degeneracy checks when‖u‖2\\\|u\\\|\_\{2\}or‖v‖2\\\|v\\\|\_\{2\}is near zero; \(iii\) segment endpoints inferred fromIncidenceconstraints \(and token fallbacks forSEG\_AB\); \(iv\) point–line distancedist\(P,AB¯\)=\|\(P−A\)×\(B−A\)\|‖B−A‖2\\mathrm\{dist\}\(P,\\overline\{AB\}\)=\\frac\{\|\(P\-A\)\\times\(B\-A\)\|\}\{\\\|B\-A\\\|\_\{2\}\}and its projection parametert=\(P−A\)⋅\(B−A\)‖B−A‖22t=\\frac\{\(P\-A\)\\cdot\(B\-A\)\}\{\\\|B\-A\\\|\_\{2\}^\{2\}\}\(used for segment\-bounded checks\)\.

#### Implemented numerical checks\.

All checks are scale\-aware via tolerancesang\_tol\_deg,len\_tol, andcircle\_tol, and each check reports a residual magnitude to support targeted repair and auditing\.

#### Circle incidence\.

ForOnCircle\(P,CIRO\)\(P,\\mathrm\{CIR\}\_\{O\}\)\(and the 2\-argumentConcyclicencoding used in our engine\), we verify

\|d\(P,O\)−r\|≤circle\_tol,\\big\|d\(P,O\)\-r\\big\|\\leq\\texttt\{circle\\\_tol\},where\(O,r\)\(O,r\)are retrieved from the circle entity\. For legacy multi\-point concyclicity scopes \(3/4 points\), we conservatively mark the constraint as skipped \(compatibility\-safe\) rather than raising an exception\.

#### Metric, angular, and midpoint constraints\.

- •EqualLength\(AB¯,CD¯\)\(\\overline\{AB\},\\overline\{CD\}\): verify\|d\(A,B\)−d\(C,D\)\|≤len\_tol\|d\(A,B\)\-d\(C,D\)\|\\leq\\texttt\{len\\\_tol\}\.
- •Parallel\(AB¯,CD¯\)\(\\overline\{AB\},\\overline\{CD\}\): letθ=∠\(AB→,CD→\)\\theta=\\angle\(\\overrightarrow\{AB\},\\overrightarrow\{CD\}\)and checkmin⁡\(θ,\|180∘−θ\|\)≤ang\_tol\_deg\\min\(\\theta,\|180^\{\\circ\}\-\\theta\|\)\\leq\\texttt\{ang\\\_tol\\\_deg\}\.
- •Perpendicular\(AB¯,CD¯\)\(\\overline\{AB\},\\overline\{CD\}\): check\|90∘−∠\(AB→,CD→\)\|≤ang\_tol\_deg\|90^\{\\circ\}\-\\angle\(\\overrightarrow\{AB\},\\overrightarrow\{CD\}\)\|\\leq\\texttt\{ang\\\_tol\\\_deg\}\.
- •EqualAngle\(∠XYZ,∠DEF\)\(\\angle XYZ,\\angle DEF\): parse each angle token into vertex triplets and verify\|∠XYZ−∠DEF\|≤ang\_tol\_deg\|\\angle XYZ\-\\angle DEF\|\\leq\\texttt\{ang\\\_tol\\\_deg\}\.
- •Midpoint\(M,UV¯\)\(M,\\overline\{UV\}\): verify\|d\(M,U\)−d\(M,V\)\|≤len\_tol\|d\(M,U\)\-d\(M,V\)\|\\leq\\texttt\{len\\\_tol\}\.
- •AngleBisector\(YE¯,∠XYZ\)\(\\overline\{YE\},\\angle XYZ\): using the named angle entity \(e\.g\., “∠XYZ\\angle XYZ”\), verify\|∠XYE−∠EYZ\|≤ang\_tol\_deg\|\\angle XYE\-\\angle EYZ\|\\leq\\texttt\{ang\\\_tol\\\_deg\}\.

#### Collinearity and incidence\.

- •Collinear\(A,B,C\)\(A,B,C\): compute twice the signed areaA2=\|\(B−A\)×\(C−A\)\|A\_\{2\}=\\big\|\(B\-A\)\\times\(C\-A\)\\big\|and check A2≤len\_tol⋅max⁡\(d\(A,B\),d\(B,C\),d\(C,A\),1\),A\_\{2\}\\leq\\texttt\{len\\\_tol\}\\cdot\\max\\\!\\big\(d\(A,B\),d\(B,C\),d\(C,A\),1\\big\),which is more scale\-stable than a fixed absolute threshold\.
- •PointOnLine\(P,AB↔\)\(P,\\overleftrightarrow\{AB\}\): checkdist\(P,AB↔\)≤len\_tol\\mathrm\{dist\}\(P,\\overleftrightarrow\{AB\}\)\\leq\\texttt\{len\\\_tol\}\.
- •PointOnSegment\(P,AB¯\)\(P,\\overline\{AB\}\): check bothdist\(P,AB↔\)≤len\_tol\\mathrm\{dist\}\(P,\\overleftrightarrow\{AB\}\)\\leq\\texttt\{len\\\_tol\}andt∈\[−10−3,1\+10−3\]t\\in\[\-10^\{\-3\},\\,1\+10^\{\-3\}\]for the projection parameterttto enforce boundedness with a small numerical margin\.

#### Vector relations\.

LetPQ→\\overrightarrow\{PQ\}be defined from point coordinates\. We use‖Δv‖2\\\|\\Delta v\\\|\_\{2\}as the residual for all vector equalities\.

- •VectorSum:UV→=PQ→\+RS→\\overrightarrow\{UV\}=\\overrightarrow\{PQ\}\+\\overrightarrow\{RS\}, check‖UV→−\(PQ→\+RS→\)‖2≤len\_tol\\\|\\overrightarrow\{UV\}\-\(\\overrightarrow\{PQ\}\+\\overrightarrow\{RS\}\)\\\|\_\{2\}\\leq\\texttt\{len\\\_tol\}\.
- •VectorDiff:UV→=PQ→−RS→\\overrightarrow\{UV\}=\\overrightarrow\{PQ\}\-\\overrightarrow\{RS\}, check‖UV→−\(PQ→−RS→\)‖2≤len\_tol\\\|\\overrightarrow\{UV\}\-\(\\overrightarrow\{PQ\}\-\\overrightarrow\{RS\}\)\\\|\_\{2\}\\leq\\texttt\{len\\\_tol\}\.
- •VectorDotZero:PQ→⋅RS→=0\\overrightarrow\{PQ\}\\cdot\\overrightarrow\{RS\}=0, check\|PQ→⋅RS→\|≤len\_tol\|\\overrightarrow\{PQ\}\\cdot\\overrightarrow\{RS\}\|\\leq\\texttt\{len\\\_tol\}\.
- •VectorLinear:UV→=∑ikiPiQi→\\overrightarrow\{UV\}=\\sum\_\{i\}k\_\{i\}\\,\\overrightarrow\{P\_\{i\}Q\_\{i\}\}\(coefficients parsed as floats\), check‖UV→−∑ikiPiQi→‖2≤len\_tol\\big\\\|\\overrightarrow\{UV\}\-\\sum\_\{i\}k\_\{i\}\\overrightarrow\{P\_\{i\}Q\_\{i\}\}\\big\\\|\_\{2\}\\leq\\texttt\{len\\\_tol\}\.

#### Triangle relations\.

For△ABC\\triangle ABCand△DEF\\triangle DEF, let side\-length vectors beℓ1=\(\|AB\|,\|BC\|,\|CA\|\)\\ell\_\{1\}=\(\|AB\|,\|BC\|,\|CA\|\)andℓ2=\(\|DE\|,\|EF\|,\|FD\|\)\\ell\_\{2\}=\(\|DE\|,\|EF\|,\|FD\|\), and angle vectors beα1=\(∠A,∠B,∠C\)\\alpha\_\{1\}=\(\\angle A,\\angle B,\\angle C\)andα2=\(∠D,∠E,∠F\)\\alpha\_\{2\}=\(\\angle D,\\angle E,\\angle F\), computed from coordinates\. Degenerate cases \(near\-zero sides/undefined angles\) are rejected\.

- •SimilarTriangles: compute ratiosri=ℓ1,i/ℓ2,ir\_\{i\}=\\ell\_\{1,i\}/\\ell\_\{2,i\}, setr¯=13∑iri\\bar\{r\}=\\frac\{1\}\{3\}\\sum\_\{i\}r\_\{i\}, and checkmaxi⁡\|ri−r¯\|/max⁡\(\|r¯\|,10−6\)≤5×10−3\\max\_\{i\}\|r\_\{i\}\-\\bar\{r\}\|/\\max\(\|\\bar\{r\}\|,10^\{\-6\}\)\\leq 5\\times 10^\{\-3\}andmaxi⁡\|α1,i−α2,i\|≤ang\_tol\_deg\\max\_\{i\}\|\\alpha\_\{1,i\}\-\\alpha\_\{2,i\}\|\\leq\\texttt\{ang\\\_tol\\\_deg\}\.
- •CongruentTriangles: checkmaxi⁡\|ℓ1,i−ℓ2,i\|≤len\_tol\\max\_\{i\}\|\\ell\_\{1,i\}\-\\ell\_\{2,i\}\|\\leq\\texttt\{len\\\_tol\}\.

#### Circle theorems\.

- •Tangent\(t¯,CIRO,T\)\(\\overline\{t\},\\mathrm\{CIR\}\_\{O\},T\): verify \(i\)TTlies on the circle withincircle\_tol, and \(ii\) the tangent direction atTTis perpendicular to the radiusOT→\\overrightarrow\{OT\}:\|90∘−∠\(OT→,TT′→\)\|≤ang\_tol\_deg\|90^\{\\circ\}\-\\angle\(\\overrightarrow\{OT\},\\overrightarrow\{TT^\{\\prime\}\}\)\|\\leq\\texttt\{ang\\\_tol\\\_deg\}, whereT′T^\{\\prime\}is the farthest other incidence point on segmentt¯\\overline\{t\}\.
- •PowerEquality\(P,CIRO,s¯,t¯\)\(P,\\mathrm\{CIR\}\_\{O\},\\overline\{s\},\\overline\{t\}\): letU,VU,Vbe the two circle points incident to the secant segments¯\\overline\{s\}\(chosen closest toPP\), and letTTbe a circle point incident to the tangent segmentt¯\\overline\{t\}that also containsPP\. We verify \|\|PU\|⋅\|PV\|−\|PT\|2\|≤τ,τ=max⁡\(len\_tol⋅max⁡\(\|PU\|⋅\|PV\|,\|PT\|2,1\),len\_tol\),\\big\|\\,\|PU\|\\cdot\|PV\|\-\|PT\|^\{2\}\\,\\big\|\\leq\\tau,\\qquad\\tau=\\max\\big\(\\texttt\{len\\\_tol\}\\cdot\\max\(\|PU\|\\cdot\|PV\|,\|PT\|^\{2\},1\),\\ \\texttt\{len\\\_tol\}\\big\),i\.e\., an error budget that scales with the magnitude of the quantities being compared\.

#### Expression constraints \(ExprConstraint\)\.

Beyond fixed geometric predicates, we support a unified numerical checker for declarative expressions of scalars, 2D vectors, and complex\-valued vectors\. GivenExprConstraintmetadata\(lhs,relation,rhs,tolerance\)\(\\texttt\{lhs\},\\texttt\{relation\},\\texttt\{rhs\},\\texttt\{tolerance\}\), we evaluate both sides using an expression evaluator \(e\.g\.,Length\(\),Angle\(\),Area\(\), vector/complex utilities\) and then check: \(i\) vector equality/inequality using‖Δv‖2\\\|\\Delta v\\\|\_\{2\}with a*scale\-adaptive*default tolerancemax⁡\(10−6,10−6⋅max⁡\(‖vlhs‖2,‖vrhs‖2,1\)\)\\max\(10^\{\-6\},10^\{\-6\}\\cdot\\max\(\\\|v\_\{\\texttt\{lhs\}\}\\\|\_\{2\},\\\|v\_\{\\texttt\{rhs\}\}\\\|\_\{2\},1\)\); \(ii\) complex equality/inequality using\|zlhs−zrhs\|\|z\_\{\\texttt\{lhs\}\}\-z\_\{\\texttt\{rhs\}\}\|with the same scale rule; \(iii\) scalar comparisons over\{=,≠,<,≤,\>,≥\}\\\{=,\\neq,<,\\leq,\>,\\geq\\\}with an additive tolerance margin\. To improve robustness in mixed textual/diagram inputs, angle tokens such as “∠XYZ\\angle XYZ” are normalized into a canonical form before evaluation, and arithmetic relations written as operator forms \(e\.g\., “lhs \+ something = rhs”\) are rewritten into standard comparisons prior to checking\. Each entry records the evaluated values and the effective tolerance, enabling reproducible debugging\.

#### Hybrid symbolic rescue for floating\-point stability\.

For a subset of geometry primitives \{EqualLength,Parallel,Perpendicular,Collinear,PointOnLine,PointOnSegment\}, we additionally implement a*symbolic rescue*path: when the numerical check returnsFalse, we convert point coordinates into rational numbers viaFraction\(str\(x\)\)\.limit\_denominator\(1024\)\(with caching\) and re\-evaluate the predicate exactly using dot/cross products and rational arithmetic\. This hybrid design materially reduces false negatives from floating\-point noise while preserving the speed and diagnostic richness of the primary numerical verifier\.

## Appendix GCalibration Exemplars for the Few\-Shot Difficulty Judge

The few\-shot difficulty judge described in Section[2\.2](https://arxiv.org/html/2606.14176#S2.SS2)is parameterised by nine exemplar problems—three per tier \(Easy,Medium,Hard\)—which jointly define the classifier’s notion of difficulty\. To verify that the judge is not overly sensitive to any particular choice of exemplars, we manually constructed three calibration sets and report their target\-tier matching rates in Table[3](https://arxiv.org/html/2606.14176#S3.T3)\. Set 1 uses the same exemplars as those used for question generation, while Sets 2 and 3 are independently constructed exemplar sets used only for robustness evaluation\. The three sets are pairwise disjoint \(27 distinct problems in total\); all items were written by the authors and cross\-checked by a second reviewer before inclusion\.

#### Selection criteria\.

For each set, we selected items that collectively cover the three problem families studied in VeriGeo \(Euclidean,Function & Calculus, andVector & Coordinate\), and required every item to satisfy the reasoning\-depth convention of its tier:Easyproblems resolve with a single named theorem or a direct substitution;Mediumproblems chain two to three theorems with at most one derived point;Hardproblems chain four or more reasoning steps, typically over multiple derived points or a non\-trivial geometric invariant\. Two authors independently scored each candidate on these criteria, and disagreements were reconciled through discussion before the item was admitted into a set\.

Set 1 is the examples we used for question geenration\. Set 2 and Set 3 are two additional sets used to verify the robuness that we choose another two sets to see whether our questions verifeid are stable\.

### G\.1Set 1

Set 1 serves a dual role: it provides the difficulty exemplars used to guide problem generation and also calibrates the few\-shot difficulty judge for evaluation\.

#### Easy / Euclidean\.

Problem\.In triangleABCABC,D,ED,Elie onAB,ACAB,ACwithAD=3AD=3,DB=6DB=6,AE=4AE=4,EC=8EC=8;∠ABC=50∘\\angle ABC=50^\{\\circ\},∠ACB=70∘\\angle ACB=70^\{\\circ\}\. Find∠DEC\\angle DEC\. Why selected\.The proportionAD/AB=AE/ACAD/AB=AE/ACgivesDE∥BCDE\\parallel BCin one step; the answer follows from a single parallel\-line angle relation—a canonical one\-stepEasy\.

#### Easy / Function & Calculus\.

Problem\.f\(x\)=x2−4f\(x\)=x^\{2\}\-4\. PointsA,BA,Blie on the curve atx=1,3x=1,3; vertical drops fromA,BA,Bhit thexx\-axis atC,DC,D\. SegmentABABconnectsA,BA,B\. Why selected\.Every length, slope, and area reduces to a direct substitution intoffwith no auxiliary construction\.

#### Easy / Vector & Coordinate\.

Problem\.A\(0,0\)A\(0,0\),B\(6,0\)B\(6,0\),C\(2,4\)C\(2,4\);AD→=AC→−AB→\\vec\{AD\}=\\vec\{AC\}\-\\vec\{AB\};MMis the midpoint ofACAC\. Find\|DM\|\|DM\|\. Why selected\.Coordinates ofDDandMMfollow from direct vector arithmetic;\|DM\|\|DM\|is one distance\-formula evaluation\.

#### Medium / Euclidean\.

Problem\.AB=AC=10AB=AC=10,BC=12BC=12\.OOis the circumcenter;PPon the bisector of∠BAC\\angle BAChas perpendicular distance22toABAB\. Find the perpendicular distance fromPPtoBCBC\. Why selected\.Three composed ideas—isosceles symmetry, the half\-angle sine relation, and a coordinate\-distance check—yield the answer, exemplifying a three\-theoremMediumchain\.

#### Medium / Function & Calculus\.

Problem\.f\(x\)=ax2f\(x\)=ax^\{2\}\(a\>0a\>0\)\.A,BA,Bon graph atx=0,2x=0,2, with\|AB\|=210\|AB\|=2\\sqrt\{10\}\. Compute∫02\(f′\(x\)\)2𝑑x\\int\_\{0\}^\{2\}\(f^\{\\prime\}\(x\)\)^\{2\}\\,dx\. Why selected\.The solver first recoversaafrom the chord length, then evaluates∫02\(2ax\)2𝑑x\\int\_\{0\}^\{2\}\(2ax\)^\{2\}\\,dx; two named tools chained through one unknown\.

#### Medium / Vector & Coordinate\.

Problem\.AB=5AB=5,BC=6BC=6,AC=7AC=7;MMmid ofACAC;AP→=AB→\+AC→\\vec\{AP\}=\\vec\{AB\}\+\\vec\{AC\},AQ→=AB→−AC→\\vec\{AQ\}=\\vec\{AB\}\-\\vec\{AC\}\. Compute\|MP\|2\+\|MQ\|2\|MP\|^\{2\}\+\|MQ\|^\{2\}\. Why selected\.Dot\-product expansion plus substitution ofAB→⋅AC→=\(AB2\+AC2−BC2\)/2\\vec\{AB\}\\cdot\\vec\{AC\}=\(AB^\{2\}\+AC^\{2\}\-BC^\{2\}\)/2from the law of cosines—two named theorems with two derived points\.

#### Hard / Euclidean\.

Problem\.AB=ACAB=AC\.MMmid ofBCBC\. The bisector of∠ABC\\angle ABCmeetsACACatDDandAMAMatPP\. GivenAP=3PMAP=3PMandAM=12AM=12, find\|BP\|\|BP\|\. Why selected\.Four interleaved insights are required: isosceles symmetry, the angle\-bisector theorem, a similarity argument for the3:13\{:\}1ratio, and Pythagoras in△BMP\\triangle BMP\.

#### Hard / Function & Calculus\.

Problem\.ffon\[0,6\]\[0,6\]consists of segmentOPOPfrom the origin and a circular arc fromPPtoQQon thexx\-axis, withxP=3x\_\{P\}=3,xQ=6x\_\{Q\}=6\. The arc lies on a circle tangent to thexx\-axis atQQ, andOPOPis tangent to the arc atPP\. Compute the average rate of change offfon\[0,3\]\[0,3\]\. Why selected\.Two simultaneous tangency conditions yield a non\-linear system whose solution gives the radius and the height ofPP; the implicit tangent–radius perpendicular is aHard\-level structural marker\.

#### Hard / Vector & Coordinate\.

Problem\.ParallelogramABCDABCD,MMmid ofBCBC,AM⟂BCAM\\perp BC,\|BD\|=2\|AB\|\|BD\|=2\|AB\|,\|AB\|=4\|AB\|=4\. Compute the area\. Why selected\.Two independent structural constraints are combined with the parallelogram identity\|BD\|2\+\|AC\|2=2\(\|AB\|2\+\|BC\|2\)\|BD\|^\{2\}\+\|AC\|^\{2\}=2\(\|AB\|^\{2\}\+\|BC\|^\{2\}\)and a vector projection—three derived points and a four\-equation system\.

### G\.2Set 2

#### Easy / Euclidean\.

Problem\.TriangleABCABCwithAB=6AB=6,BC=8BC=8,AC=10AC=10\.D,E,FD,E,Fare midpoints ofBC,AC,ABBC,AC,AB\.KKonEFEFwithEK=1EK=1\. Find\|DK\|\|DK\|\. Why selected\.The midsegment theorem reduces the configuration to a coordinate computation; one named theorem to reach the answer\.

#### Easy / Function & Calculus\.

Problem\.f\(x\)=x2f\(x\)=x^\{2\}andg\(x\)=2x\+3g\(x\)=2x\+3\. LetNNbe the number of intersection points on\[−2,4\]\[\-2,4\]; letTTbe theyy\-intercept of the tangent toffatx=1x=1\. ComputeN\+TN\+T\. Why selected\.Two lightweight sub\-tasks \(intersection count and tangentyy\-intercept\) summed; no theorem chaining, just composition of two elementary checks\.

#### Easy / Vector & Coordinate\.

Problem\.TriangleABCABC,AB=5AB=5,AC=7AC=7\.DDis constructed so thatABDCABDCis a parallelogram\.MMmid ofBCBC,\|AM\|=4\|AM\|=4\. Find\|BC\|\|BC\|\. Why selected\.The parallelogram diagonal relation\|AD\|2\+\|BC\|2=2\(\|AB\|2\+\|AC\|2\)\|AD\|^\{2\}\+\|BC\|^\{2\}=2\(\|AB\|^\{2\}\+\|AC\|^\{2\}\)with\|AD\|=2\|AM\|\|AD\|=2\|AM\|gives\|BC\|\|BC\|in one identity\.

#### Medium / Euclidean\.

Problem\.TrapezoidABCDABCDwithAB∥CDAB\\parallel CD,AB=8AB=8,CD=12CD=12\. The diagonals meet atPP; a line throughPPparallel toABABmeetsADADatMMandBCBCatNN\. Find\|MN\|\|MN\|\. Why selected\.Solver must recognise that\|MN\|\|MN\|is the harmonic mean of the parallel sides—two similarity chains plus an algebraic reduction\.

#### Medium / Function & Calculus\.

Problem\.f\(x\)=x3f\(x\)=x^\{3\}on\[0,3\]\[0,3\]\. Letccsatisfy∫0cf\(t\)𝑑t=4\\int\_\{0\}^\{c\}f\(t\)\\,dt=4andddsatisfyf′′\(d\)=6f^\{\\prime\\prime\}\(d\)=6\. Findc\+dc\+d\. Why selected\.Two independent calculus skills—definite integration and second\-derivative root—feed a single algebraic sum\.

#### Medium / Vector & Coordinate\.

Problem\.ABECABECis a parallelogram withAE⟂BCAE\\perp BC,AB=5AB=5,BC=6BC=6\. Find\|AE\|\|AE\|\. Why selected\.Perpendicular diagonals force a rhombus\-like constraint that combines with the parallelogram diagonal identity; two\-theorem chain with one derived point\.

#### Hard / Euclidean\.

Problem\.SquareABCDABCDwith side2020\.E∈BCE\\in BC,F∈CDF\\in CDwithBE=DFBE=DF;AE,AFAE,AFmeet diagonalBDBDatP,QP,Q\. Given\|PQ\|=52\|PQ\|=5\\sqrt\{2\}, find\|EF\|\|EF\|\. Why selected\.A45∘45^\{\\circ\}rotation pairsE↔FE\\leftrightarrow F; similarity relates\|PQ\|\|PQ\|toBEBE; a Pythagorean step then delivers\|EF\|\|EF\|—four chained ideas including a hidden symmetry\.

#### Hard / Function & Calculus\.

Problem\.Parabolay=x2y=x^\{2\}\.PPlies on it in the first quadrant\. The tangent atPPmeets the axes atA,BA,B; the midpointMMofABABlies ony=−xy=\-x\. Find the area of△OAB\\triangle OAB\. Why selected\.ParametriseP=\(t,t2\)P=\(t,t^\{2\}\), derive the tangent, impose the midpoint\-on\-line condition, solve fortt, and compute the area—a four\-step symbolic pipeline\.

#### Hard / Vector & Coordinate\.

Problem\.Equilateral△ABC\\triangle ABCwith side66\.BBis the midpoint ofCPCP;GGis the centroid of△ABC\\triangle ABC\. Find\|PG\|\|PG\|\. Why selected\.ExpressPPviaCP→=2CB→\\vec\{CP\}=2\\vec\{CB\}, placeG=\(A\+B\+C\)/3G=\(A\+B\+C\)/3, and evaluate\|PG\|2\|PG\|^\{2\}through dot products—three derived points with an equilateral\-triangle identity\.

### G\.3Set 3

#### Easy / Euclidean\.

Problem\.Right triangle with∠ACB=90∘\\angle ACB=90^\{\\circ\},AC=6AC=6,BC=8BC=8\.ADADbisects∠CAB\\angle CAB,D∈BCD\\in BC;DE⟂ABDE\\perp AB,E∈ABE\\in AB\. Find\|DE\|\|DE\|\. Why selected\.The angle\-bisector theorem on the right triangle combined with the equal\-tangent identity—one named theorem with two derived points\.

#### Easy / Function & Calculus\.

Problem\.f\(x\)=x2−6x\+8f\(x\)=x^\{2\}\-6x\+8\.P,QP,Qon graph atx=1,5x=1,5;RRis the vertex of the parabola\. Find the area of△PQR\\triangle PQR\. Why selected\.Three substitutions plus the triangle area formula—no theorem chaining, but three named points force explicit bookkeeping\.

#### Easy / Vector & Coordinate\.

Problem\.ParallelogramABCDABCD;E∈ABE\\in ABwithAE=4AE=4,F∈CDF\\in CDwithCF=4CF=4;AB=12AB=12,AD=6AD=6\. Find\|EF\|\|EF\|\. Why selected\.Vector decompositionEF→=EB→\+BC→\+CF→\\vec\{EF\}=\\vec\{EB\}\+\\vec\{BC\}\+\\vec\{CF\}collapses to one magnitude evaluation\.

#### Medium / Euclidean\.

Problem\.TriangleABCABChas area120120;MMmid ofBCBC;DDconstructed so thatABMDABMDis a parallelogram\.PPis the intersection ofACACandDMDM\. Find the area of△ADP\\triangle ADP\. Why selected\.Parallelogram construction, similarity between△ADP\\triangle ADPand△ACM\\triangle ACM, and an area\-ratio reduction—a three\-theoremMedium\.

#### Medium / Function & Calculus\.

Problem\.f\(x\)=x2−2x−3f\(x\)=x^\{2\}\-2x\-3\. Find the area of the region bounded by the curve and thexx\-axis\. Why selected\.Standard definite\-integral area problem after solving a quadratic for the roots—two tools chained through the roots\.

#### Medium / Vector & Coordinate\.

Problem\.ParallelogramABCDABCDwithAB→⋅AD→=0\\vec\{AB\}\\cdot\\vec\{AD\}=0\(AB=6AB=6,AD=8AD=8\)\.P≠CP\\neq Csuch thatPBCDPBCDis a kite withPB=CBPB=CB,PD=CDPD=CD\. Find\|AP\|\|AP\|\. Why selected\.Orthogonality turns the parallelogram into a rectangle; the kite constraint reflectsPPover a diagonal;\|AP\|\|AP\|follows by coordinate computation—three steps composed\.

#### Hard / Euclidean\.

Problem\.Isosceles trapezoidABCDABCDwithAB∥CDAB\\parallel CD,AB=10AB=10,CD=4CD=4\. The Euler line of△ABC\\triangle ABCis parallel toABAB\. Find\|AD\|\|AD\|\. Why selected\.Euler\-line parallelism produces a centroid–circumcenter alignment condition that reduces to a non\-linear equation in\|AD\|\|AD\|; the Euler\-line invocation itself is aHard\-level structural marker\.

#### Hard / Function & Calculus\.

Problem\.f\(x\)=\|x2−16\|f\(x\)=\|x^\{2\}\-16\|\.A,BA,Bon graph atx=−2,5x=\-2,5;LLis the tangent toffatBB;CCis the intersection ofLLwith theyy\-axis;MMon segmentACACwithAM:MC=2:3AM:MC=2:3\. FindyMy\_\{M\}\. Why selected\.Absolute\-value branch selection, tangent construction on the branch containingBB, and a weighted midpoint—four chained sub\-problems\.

#### Hard / Vector & Coordinate\.

Problem\.Triangle withAB=6AB=6,AC=8AC=8,∠BAC=60∘\\angle BAC=60^\{\\circ\};ADADbisects∠BAC\\angle BACwith\|AD\|=10\|AD\|=10;PPis the midpoint ofBDBD\. Find\|CP\|\|CP\|\. Why selected\.The angle\-bisector length formula combines with a Stewart\-style ratio onBDBDand a dot\-product projection for\|CP\|\|CP\|—three named theorems interleaved\.

### G\.4Discussion

Across the three sets, the target\-tier matching rates reported in Table[3](https://arxiv.org/html/2606.14176#S3.T3)remain consistently high within each tier, indicating that the judge’s calibration does not hinge on any single choice of exemplars: replacing all nine items with a pairwise\-disjoint substitute yields matching rates that vary by only a few percentage points\. When adapting VeriGeo to a new curriculum, practitioners may therefore freely substitute the above exemplars with curriculum\-specific ones, provided each tier’s items satisfy the same reasoning\-depth convention described in the selection criteria above\.

## Appendix HDetails of the Seed\-Conditioned Difficulty Modulation Evaluation

We provide additional details for the seed\-conditioned difficulty modulation experiment reported in the main text\. This experiment evaluates whether VeriGeo can modulate the difficulty of a generated variant relative to a given MathVista seed while preserving the seed’s underlying geometric theme\.

#### Seed\-conditioned generation setting\.

We use the same 100 MathVista source problems as in the seed\-conditioned generation experiment\. For each seed problem, we construct two generation settings\. In theHardersetting, the generator is instructed to produce a variant that preserves the seed’s geometric theme but requires a higher level of reasoning\. In theEquivalentsetting, the generator is instructed to produce a variant with comparable difficulty while still preserving the original theme\. In both settings, the generator is required to maintain the core geometric context of the seed rather than replacing it with an unrelated problem\.

#### Knowledge Similarity Evaluation

We evaluate the knowledge similarity in Figure[2](https://arxiv.org/html/2606.14176#S3.F2)\(b\) using an LLM\-as\-a\-judge prompt\. The prompt is shown below:

```
"""
You are a geometry concept similarity evaluator.
Given two lists of geometry concepts from paired problems, score how similar
their concept distributions are. Consider semantic equivalence and overlap.
Return JSON only with:
  - similarity: float between 0 and 1 (1 = identical, 0 = no overlap).
"""
```

#### Pairwise difficulty\-relation judge\.

We provide the prompt used for pairwise difficulty comparison, which asks the judge to determine whether one question is harder than another \(PROMPT\_HARDER\) or whether their difficulties are equivalent \(PROMPT\_EQUIV\)\.

```
PROMPT_HARDER = """
You are an LLM-based pairwise difficulty-relation judge for geometry problems.

You are given a seed problem and a generated variant authored from that seed.
The requested generation condition is NOT provided to you. Your task is to
compare the generated variant against the seed problem only after generation.

[SEED] (image attached)
SEED question: {seed_question}
SEED choices: {seed_choices}

[GENERATED]
GENERATED problem statement:
{gen_problem}

GENERATED proof outline (numbered steps):
{gen_proof}

Decide whether the GENERATED problem is strictly HARDER than the SEED problem.

Base your judgment on:
- changes in the number of reasoning steps;
- the degree of concept composition;
- the need for auxiliary constructions;
- whether the solution introduces additional algebraic or geometric
  dependencies beyond those in the seed.

Answer "harder" only if the generated problem is meaningfully harder than the seed.
Otherwise, answer "not_harder".

Return STRICT JSON only, with no markdown:
{{"answer": "harder" | "not_harder", "reasoning": "<=3 sentences
explaining the judgment"}}
"""
```

```
PROMPT_EQUIV = """
You are an LLM-based pairwise difficulty-relation judge for geometry problems.

You are given a seed problem and a generated variant authored from that seed. The requested
generation condition is NOT provided to you. Your task is to compare the generated
variant against the seed problem only after generation.

[SEED] (image attached)
SEED question: {seed_question}
SEED choices: {seed_choices}

[GENERATED]
GENERATED problem statement:
{gen_problem}

GENERATED proof outline (numbered steps):
{gen_proof}

Decide whether the GENERATED problem has approximately the same difficulty as the SEED problem.

Base your judgment on:
- changes in the number of reasoning steps;
- the degree of concept composition;
- the need for auxiliary constructions;
- whether the solution introduces additional algebraic or geometric
  dependencies beyond those in the seed.

Answer "equivalent" if the generated problem is approximately the same difficulty as the
seed. Answer "not_equivalent" if it is meaningfully harder or meaningfully easier.

Return STRICT JSON only, with no markdown:
{{"answer": "equivalent" | "not_equivalent", "reasoning": "<=3 sentences
explaining the judgment"}}
"""
```

#### Metric\.

Letri∈\{Harder,Equivalent\}r\_\{i\}\\in\\\{\\textsc\{Harder\},\\textsc\{Equivalent\}\\\}denote the requested relation for theii\-th seed\-conditioned variant, and letr^i\\hat\{r\}\_\{i\}denote the relation predicted by the pairwise judge\. We define the target\-difficulty matching rate as1N∑i=1N𝟏\[r^i=ri\],\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\!\\left\[\\hat\{r\}\_\{i\}=r\_\{i\}\\right\],whereNNis the number of generated variants in the corresponding setting\. AHardervariant is counted as a match only when the judge determines that it is harder than the seed\. AnEquivalentvariant is counted as a match only when the judge determines that it has approximately the same difficulty as the seed; variants judged as either easier or harder are treated as mismatches\.

#### Discussion\.

As reported in the main text, VeriGeo achieves a 100\.0% target\-difficulty matching rate in theHardersetting and an 80\.0% matching rate in theEquivalentsetting\. This gap is expected\. Increasing difficulty provides a clearer control direction, since the generator can introduce additional reasoning steps, auxiliary constructions, or concept composition\. By contrast, preserving an exactly equivalent difficulty level is intrinsically more ambiguous, especially for seed problems near the boundary between two difficulty levels\. These judge\-based results complement the structural analysis in Figure[2](https://arxiv.org/html/2606.14176#S3.F2)\(c\), where the difficulty\-increased variants shift toward higher concept counts while the equivalent\-difficulty variants remain closer to the original seed distribution\.
VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

Similar Articles

Quantitative Video World Model Evaluation for Geometric-Consistency

Show HN: Geomatic – a command-driven geometry studio enabled with autodiff

Towards Consistent Video Geometry Estimation

Measuring Representation Robustness in Large Language Models for Geometry

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

Submit Feedback

Similar Articles

Quantitative Video World Model Evaluation for Geometric-Consistency
Show HN: Geomatic – a command-driven geometry studio enabled with autodiff
Towards Consistent Video Geometry Estimation
Measuring Representation Robustness in Large Language Models for Geometry
VeriGate: Verifier-Gated Step-Level Supervision for GRPO