Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning
Summary
A framework called GuardedRepair is proposed for post-hoc replacement of LLM mathematical reasoning, using selective replacement with safety guards to fix errors while minimizing harm to correct traces. On GSM8K it improves accuracy from 95.60% to 96.89% without breaking correct answers.
View Cached Full Text
Cached at: 05/26/26, 09:04 AM
# Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning
Source: [https://arxiv.org/html/2605.24613](https://arxiv.org/html/2605.24613)
###### Abstract
Post\-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful\. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace\.
We presentGuardedRepair, a guarded best\-of\-NNrepair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer\-changing candidates only when deterministic verification guards support replacement\. The framework combines lightweight symbolic checks, surface semantic\-risk diagnostics, bounded candidate generation, and conservative acceptance policies\.
On the full GSM8K test set, where the initial reasoner already achieves 95\.60% accuracy,GuardedRepairimproves final accuracy to 96\.89%, fixing 17 of 58 remaining errors without measured broken\-correct cases in the main run\. On a weak\-reasoner ASDiv setting, accuracy improves from 78\.40% to 87\.60%\. Direct regeneration baselines show that this gain is not explained by stronger\-model re\-solving alone: re\-solving all GSM8K examples lowers accuracy to 93\.03% and breaks 47 initially correct answers\. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated\.
These results support viewing post\-hoc repair as harm\-aware selective replacement rather than unconstrained re\-solving\.
Guarded Repair for Harm\-Aware Post\-hoc Replacement of LLM Mathematical Reasoning
Haizhou Xiahaizhoux@outlook\.com
## 1Introduction
Large language models \(LLMs\) are strong mathematical word\-problem solvers, but their reasoning traces can still contain arithmetic slips, missing constraints, semantic misbindings, or malformed final answers\. These errors are difficult to handle by final\-answer evaluation alone: a trace can be fluent while solving the wrong problem, and a repair model can confidently replace an initially correct answer with an incorrect one\.
We study*post\-hoc repair*: an initial LLM has already produced a cached reasoning trace and final answer, and the system must decide whether to preserve or replace that trace\. This setting differs from ordinary re\-solving, self\-correction, or best\-of\-NNselection because the original trace may already be correct\. The central risk is therefore not only whether a new candidate is plausible, but whether accepting it is safer than keeping the original\. We formulate this as harm\-aware selective replacement: maximize fixed errors while controlling broken\-correct cases\.
GuardedRepairinstantiates this view as a guarded best\-of\-NNreplacement protocol\. The individual ingredients arithmetic checking, surface semantic heuristics, multiple candidates, and deterministic filters are not claimed to be new on their own\. The contribution is to make the replacement decision explicit\. The trigger controls when additional compute is spent and how many repair opportunities are exposed; the candidate generator provides a bounded set of possible replacements; the acceptance guards decide whether an answer\-changing candidate is safer than the cached trace; and the evaluation protocol reports both fixed and broken transitions\. This decomposition matters because a system can improve final accuracy while still causing unacceptable replacement harm\. The goal is therefore not to introduce a new verifier, but to study how post\-hoc repair should be accepted and evaluated when the original trace may already be correct\.
On full GSM8K, where the initialdeepseek\-v4\-flashreasoner leaves only 58 errors,GuardedRepairfixes 17 of them and improves accuracy from 95\.60% to 96\.89% in the main run\. On a seed\-42 numeric ASDiv subset withqwen2\.5:1\.5bas the initial reasoner, it improves accuracy from 78\.40% to 87\.60%, fixing 92 errors\. Additional ASDiv seeds, GSM8K repair\-stage reruns, weak\-reasoner checks on SVAMP and MultiArith, and local Qwen repair\-model checks show consistent positive net gains, while also revealing rare broken\-correct cases and lower repair recall for weaker repair models\. These results support the intended claim: guarded repair substantially reduces replacement risk relative to direct regeneration, but does not eliminate that risk universally and remains sensitive to candidate quality\.
Our contributions are:
- •We identify replacement risk as a central failure mode in post\-hoc mathematical\-reasoning repair: a repair system must be evaluated not only by what it fixes, but also by what it breaks\.
- •We formalize repair as harm\-aware selective replacement over cached traces, separating the trigger, candidate generator, acceptance guard, and fixed/broken evaluation roles\.
- •We instantiate this formulation with a guarded best\-of\-NNprotocol and evaluate it with fixed/broken accounting, accepted\-repair precision, candidate\-flow analysis, and compute\-aware direct\-regeneration baselines\.
- •We show on GSM8K, numeric ASDiv, and weak\-reasoner robustness settings that guarded repair improves the fixed/broken tradeoff over direct regeneration, while clarifying the remaining limits of repair safety and candidate quality\.
## 2Related Work
#### Reasoning and verification\.
Chain\-of\-thought and decomposition prompts improve mathematical reasoning by encouraging intermediate steps\(Wei et al\.,[2022](https://arxiv.org/html/2605.24613#bib.bib1); Zhou et al\.,[2023](https://arxiv.org/html/2605.24613#bib.bib2)\)\. Verifier and process\-supervision methods score candidate solutions or intermediate steps\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.24613#bib.bib3); Lightman et al\.,[2023](https://arxiv.org/html/2605.24613#bib.bib4)\); recent process\-reward work extends this with learned process\-level evaluators\(She et al\.,[2025](https://arxiv.org/html/2605.24613#bib.bib12)\)\.GuardedRepairshares the goal of checking reasoning, but performs post\-hoc replacement of an existing cached trace and uses deterministic diagnostics rather than a learned verifier\.
#### Self\-correction, test\-time scaling, and tools\.
Self\-refinement and self\-correction ask models to critique or revise their own outputs\(Madaan et al\.,[2023](https://arxiv.org/html/2605.24613#bib.bib5); Wu et al\.,[2025](https://arxiv.org/html/2605.24613#bib.bib13); Xiong et al\.,[2025](https://arxiv.org/html/2605.24613#bib.bib14); Zhang et al\.,[2025](https://arxiv.org/html/2605.24613#bib.bib15)\)\. Test\-time scaling and best\-of\-NNmethods spend additional inference to sample, rank, or revise solutions\. Tool\-augmented methods such as ReAct, PAL, and program\-of\-thought prompting offload computation to external actions or executable programs\(Yao et al\.,[2023](https://arxiv.org/html/2605.24613#bib.bib6); Gao et al\.,[2023](https://arxiv.org/html/2605.24613#bib.bib7); Chen et al\.,[2023](https://arxiv.org/html/2605.24613#bib.bib8)\)\. In contrast,GuardedRepairdoes not re\-solve every problem or convert it into a program; it repairs cached natural\-language traces only when replacement appears safer than preservation\.
#### Selective prediction and math benchmarks\.
Selective prediction studies when a system should abstain under uncertainty\(El\-Yaniv and Wiener,[2010](https://arxiv.org/html/2605.24613#bib.bib16); Geifman and El\-Yaniv,[2017](https://arxiv.org/html/2605.24613#bib.bib17)\)\. Our action is different: the system already has an answer and must decide whether to replace it\. Prior math word\-problem work emphasizes the need to model quantities and relations\(Hosseini et al\.,[2014](https://arxiv.org/html/2605.24613#bib.bib9); Patel et al\.,[2021](https://arxiv.org/html/2605.24613#bib.bib10)\); ASDiv provides a diverse arithmetic corpus\(Miao et al\.,[2020](https://arxiv.org/html/2605.24613#bib.bib11)\)\. We evaluate on full GSM8K and numeric ASDiv, and Appendix[A\.1](https://arxiv.org/html/2605.24613#A1.SS1)summarizes howGuardedRepairdiffers from adjacent paradigms\. Adjacent paradigms mainly decide which newly generated solution to trust, how to revise a solution through model feedback, or when to abstain\.GuardedRepairinstead starts from an already cached trace and asks a different question: whether an answer\-changing repair is safer than preserving the original trace\. This makes broken\-correct accounting central rather than auxiliary\.
## 3Method
DiagnosisRepair generationGuarded selectionProblemxxInitial reasoningr0r\_\{0\}Multi\-level diagnosticssymbolic checkingconstraint coveragesurface risk graphRepairneeded?Keep originalreasoningDeterministicdiagnostic hintHint\-guidedbest\-of\-NNrepairrc\(1\),rc\(2\),…,rc\(N\)r\_\{c\}^\{\(1\)\},r\_\{c\}^\{\(2\)\},\\ldots,r\_\{c\}^\{\(N\)\}Candidate verificationoutput cleanlinessmeta\-consistencygraph guardequation supportAcceptcandidate?Keep originalreasoningUse repairedreasoningFinal reasoningrfr\_\{f\}Final answerafa\_\{f\}NoYesNoYes
Figure 1:Guarded best\-of\-NNpost\-hoc repair\. The default action is to keep the cached trace; replacement occurs only when a triggered repair candidate passes deterministic guards\.### 3\.1Problem Formulation
Given a problemxx, cached initial reasoningr0r\_\{0\}, and answera0a\_\{0\}, the system outputs final reasoningrfr\_\{f\}and answerafa\_\{f\}\. Gold answers are used only for evaluation\. LetFFbe the number of initially wrong examples fixed by repair andBBthe number of initially correct examples broken by repair\. We view post\-hoc repair as constrained replacement:
maxπF\(π\)s\.t\.B\(π\)≤ϵ\.\\max\_\{\\pi\}F\(\\pi\)\\quad\\text\{s\.t\.\}\\quad B\(\\pi\)\\leq\\epsilon\.\(1\)The default action is to keep the original trace\. A replacement is made only when a candidate passes the guarded acceptance policy:
rf=\{rc,if a candidate repair passes all gates,r0,otherwise\.r\_\{f\}=\\begin\{cases\}r\_\{c\},&\\text\{if a candidate repair passes all gates\},\\\\ r\_\{0\},&\\text\{otherwise\}\.\\end\{cases\}\(2\)
This framing decomposes repair into distinct decision components\. The triggerT\(x,r0\)T\(x,r\_\{0\}\)determines whether additional compute is spent and therefore controls candidate recall and cost\. The generatorGN\(x,r0\)G\_\{N\}\(x,r\_\{0\}\)constructs a bounded candidate set of size at mostNN\. The acceptance ruleA\(x,r0,rc\)A\(x,r\_\{0\},r\_\{c\}\)implements the harm constraint by allowing answer\-changing replacements only when deterministic evidence supports the candidate over the cached trace\. Finally, the evaluation protocol reports bothFFandBB, because higher final accuracy can be misleading if it is obtained by breaking many initially correct traces\.
### 3\.2Diagnostics and Triggering
GuardedRepairis a replacement\-risk decision protocol rather than a standalone verifier, parser, or decoding algorithm\. It instantiates the decomposition above with lightweight deterministic diagnostics, bounded best\-of\-NNcandidate generation, and conservative acceptance guards\. The guards are intentionally simple and auditable: their role is not to solve semantic parsing or verification as standalone tasks, but to operationalize the replacement decision under an explicit harm constraint\. The diagnostics include arithmetic equation checking, numeric constraint coverage, meta\-consistency scoring, and surface semantic\-risk graph signals\. The surface semantic\-risk graph is a conservative risk feature rather than a semantic parser: it extracts quantity mentions and surface relation patterns such as aggregation, comparison, rate, change\-event, and part\-whole relations, then emits risk categories such as quantity binding, comparison warnings, answer\-format warnings, or per\-entity rate omissions\. Details, thresholds, risk categories, and a manual audit of graph\-risk signals are in Appendices[B\.2](https://arxiv.org/html/2605.24613#A2.SS2),[B\.4](https://arxiv.org/html/2605.24613#A2.SS4),[B\.5](https://arxiv.org/html/2605.24613#A2.SS5),[B\.7](https://arxiv.org/html/2605.24613#A2.SS7), and[B\.17](https://arxiv.org/html/2605.24613#A2.SS17)\.
Repair is attempted only when diagnostics indicate risk, including empty generations, arithmetic failures, high\-risk semantic issues, severe missing\-constraint signals, or low meta\-consistency\. Otherwise, the cached trace is preserved without calling the repair model\.
### 3\.3Best\-of\-NNRepair and Guarded Acceptance
For each triggered case, the repair model generates up toNNJSON\-formatted candidates using different prompt strategies: hint\-guided repair, strict concise arithmetic, and solving from the original problem while treating the initial trace as a warning signal\. The final experiments useN=3N=3\. Each candidate is normalized, re\-diagnosed, and evaluated by deterministic gates for output cleanliness, semantic\-graph risk, meta\-consistency, and equation support\. The policy does not use a learned verifier or LLM judge\. In relaxed\-support settings, the final answer must be explicitly supported by an arithmetic or number\-theoretic derivation\. Pseudocode, prompts, cleanliness checks, equation\-support grammar, and ablation switches are in Appendices[B\.9](https://arxiv.org/html/2605.24613#A2.SS9),[B\.10](https://arxiv.org/html/2605.24613#A2.SS10),[B\.11](https://arxiv.org/html/2605.24613#A2.SS11),[B\.12](https://arxiv.org/html/2605.24613#A2.SS12),[B\.16](https://arxiv.org/html/2605.24613#A2.SS16), and[B\.18](https://arxiv.org/html/2605.24613#A2.SS18)\.
## 4Experimental Setup
#### Datasets and models\.
The primary evaluation uses the full GSM8K test split with 1,319 samples\. The weak\-reasoner evaluation uses a uniformly sampled 1,000\-example numeric ASDiv subset drawn with seed 42 from a 2,147\-example numeric pool\. We sample to control repair\-stage cost; the subset is drawn before any repair experiment and is not selected by model performance\. To assess sampling sensitivity, Appendix[B\.14](https://arxiv.org/html/2605.24613#A2.SS14)reports three additional 1,000\-example seeds from the same pool\. The filtering and sampling procedure is given in Appendix[B\.3](https://arxiv.org/html/2605.24613#A2.SS3)\. GSM8K initial traces are generated bydeepseek\-v4\-flash, while the main repair candidates usedeepseek\-v4\-pro\. In weak\-reasoner settings, the initial reasoner isqwen2\.5:1\.5b\. We also run localqwen2\.5:7bandqwen2\.5:14brepair\-model portability checks on the ASDiv seed\-42 setting while keeping the cached initial traces and acceptance policy fixed\. Repair calls use temperature 0\.0, JSON output, a 768\-token budget, and one 512\-token format retry\. Model and decoding details are in Appendix[B\.1](https://arxiv.org/html/2605.24613#A2.SS1)\.
#### Metrics\.
We report initial/final accuracy, absolute improvement, fixed errors, broken\-correct cases, accepted\-repair precision, error repair rate, and harm rate:
HarmRate=\#BrokenCorrect\#Total\.\\mathrm\{HarmRate\}=\\frac\{\\\#\\mathrm\{BrokenCorrect\}\}\{\\\#\\mathrm\{Total\}\}\.\(3\)Accepted repair precision is the fraction of accepted repairs that are wrong\-to\-correct transitions\. This is conservative: accepted wrong\-to\-wrong modifications lower precision even if they do not break correct answers\. We also report exact paired sign\-test probabilities for changed examples and use “zero measured harm” only descriptively; with zero observed broken\-correct cases, the rule\-of\-three gives approximate 95% upper bounds of 0\.23% for GSM8K and 0\.30% for the 1,000\-sample ASDiv setting\.
## 5Results
### 5\.1Main Results
Table[1](https://arxiv.org/html/2605.24613#S5.T1)shows the main results\. On full GSM8K,GuardedRepairimproves accuracy from 95\.60% to 96\.89%, fixing 17 of 58 initially incorrect cases with zero measured broken\-correct cases in the main run\. Although the absolute GSM8K gain is 1\.29 points, this setting starts from a high baseline; the result corresponds to correcting 29\.3% of the remaining initial errors without measured harm in the main run\. On the weak\-reasoner ASDiv setting, it improves accuracy from 78\.40% to 87\.60%, fixing 92 errors with zero measured broken\-correct cases\. The paired sign\-test probabilities for changed examples are1\.53×10−51\.53\\times 10^\{\-5\}on GSM8K and4\.04×10−284\.04\\times 10^\{\-28\}on ASDiv\.
Table 1:Main results\. Accuracy and improvement are reported in percentage points\. In the high\-baseline GSM8K setting,GuardedRepairfixes 17 of 58 remaining errors; in the weak\-reasoner ASDiv setting, it fixes 92 errors\. Both main runs introduce zero measured broken\-correct cases\. Under the rule of three, zero observed harm implies approximate 95% upper bounds of 0\.23% on GSM8K and 0\.30% on ASDiv numeric\-1000\.The larger ASDiv gain reflects greater repairable error mass: GSM8K has 58 initially wrong cases, while the weak ASDiv setting has 216\. The ASDiv result is not unique to seed 42: across four uniformly sampled 1,000\-example seeds, the mean improvement is \+9\.05 points, with gains ranging from \+8\.50 to \+9\.80 points\. We report seed\-level mean, standard deviation, and range rather than treating the seed\-42 subset as a full\-pool estimate\. Additional checks in the appendix support the robustness and boundaries of this result: GSM8K repair\-stage reruns remain above the initial accuracy but show rare broken\-correct cases \(Appendix[B\.15](https://arxiv.org/html/2605.24613#A2.SS15)\); additional ASDiv seeds also reveal rare harm \(Appendix[B\.14](https://arxiv.org/html/2605.24613#A2.SS14)\); and SVAMP/MultiArith weak\-reasoner checks improve final accuracy while each introducing one broken\-correct case \(Appendix[A\.3](https://arxiv.org/html/2605.24613#A1.SS3)\)\. Repair behavior and accepted\-outcome accounting are in Appendices[A\.2](https://arxiv.org/html/2605.24613#A1.SS2)and[B\.13](https://arxiv.org/html/2605.24613#A2.SS13)\.
### 5\.2Ablations
Table[2](https://arxiv.org/html/2605.24613#S5.T2)reports GSM8K ablations\. Increasing fromN=1N=1toN=3N=3improves repair recall, but fixed/broken accounting is non\-monotonic:N=2N=2fixes more thanN=1N=1but introduces two broken\-correct cases\. Removing the semantic graph gate reduces final accuracy and fixes fewer errors, while disabling equation\-support verification introduces two broken\-correct cases and lowers accuracy\. Relaxed missing\-constraint acceptance reaches the same final accuracy as the main system but accepts more repairs without fixing more errors, so we keep the stricter policy\. These results form a discrete fix–harm frontier: the selected configuration is not the one that accepts the most repairs, but the one with the best observed fixed/broken tradeoff\.
Table 2:Ablation results on the full GSM8K test split\. Accuracy and improvement are reported in percentage points\. The final system uses guardedN=3N=3repair with surface graph diagnostics and equation\-support verification\.
### 5\.3Direct Strong\-Model Baselines
Table[3](https://arxiv.org/html/2605.24613#S5.T3)tests whether gains come merely from invoking the stronger repair model\. Directly re\-solving all GSM8K examples with the strong model lowers accuracy to 93\.03% and breaks 47 initially correct solutions\. Re\-solving only triggered examples reduces cost but still breaks 7 correct cases\. The closest compute\-controlled baseline, direct best\-of\-3 solving with the same triggered set, strong model, number of calls, and gates but without the initial trace or diagnostic hint, reaches 96\.29%, fixes 12 errors, and breaks 3 correct cases\. Under the same 1,498\-call budget,GuardedRepairreaches 96\.89%, fixes 17 errors, and introduces zero measured broken\-correct cases\.
Table 3:Compute\-aware strong\-model baselines on the full GSM8K test split\. Relative calls are normalized by one full direct solve pass over all 1,319 examples\. Direct strong\-model regeneration either hurts accuracy or introduces broken\-correct cases\. Under the same 1,498\-call budget as direct best\-of\-3 with gates,GuardedRepairfixes more errors and introduces no measured broken\-correct cases\.
### 5\.4Open\-Source Repair\-Model Portability
To test whether the guarded replacement protocol depends entirely on the proprietary API repair model, we replace the repair model in the ASDiv seed\-42 setting with local Qwen2\.5 repair models while keeping the same cachedqwen2\.5:1\.5binitial traces and deterministic acceptance policy\. Table[4](https://arxiv.org/html/2605.24613#S5.T4)shows that both local repair models preserve positive gains without measured broken\-correct cases, but their gains are smaller thandeepseek\-v4\-pro\. This suggests that the protocol transfers beyond a single API repair model, while repair recall remains strongly dependent on candidate quality\.
Table 4:Open\-source repair\-model portability check on the ASDiv seed\-42 setting\. DeepSeek denotesdeepseek\-v4\-pro; Qwen\-7B and Qwen\-14B denote localqwen2\.5:7bandqwen2\.5:14b\. The initial reasoner is fixed toqwen2\.5:1\.5b, cached initial traces are unchanged, and only the repair model is replaced\. Local Qwen repair models yield smaller but positive gains without measured broken\-correct cases\.
## 6Analysis
#### Why guarding is necessary\.
The ablations show that more candidate generation is not enough:N=2N=2increases fixes but introduces broken\-correct cases, and disabling equation support accepts more repairs while lowering final accuracy\. A typical unsafe candidate copies or derives a plausible number without explicitly supporting the final answer, or omits a late constraint while producing a clean\-looking derivation\. The guarded policy therefore requires a candidate to be safer than preserving the initial trace, not merely plausible\.
#### Candidate flow and remaining errors\.
Candidate\-flow logs separate generation failures from conservative rejections\. On GSM8K, 25 initially wrong examples obtain at least one correct candidate and 17 are accepted; on ASDiv, 130 obtain a correct candidate and 92 are accepted\. The remaining errors come from both candidate\-generation limits and guarded\-acceptance false rejections\. This explains the main tradeoff: stricter gates reduce harm but reject some correct candidates\. Appendix[A\.4](https://arxiv.org/html/2605.24613#A1.SS4)reports the full flow table, and Appendix[A\.5](https://arxiv.org/html/2605.24613#A1.SS5)provides representative fixed\-error, rejected\-unsafe\-repair, and false\-rejection cases\.
#### Cost\.
Best\-of\-3 uses 1,498 repair attempts on GSM8K and 1,859 on ASDiv because repair is triggered only for suspicious traces rather than for every example\. This corresponds to 1\.14 strong\-model calls per GSM8K example and 1\.86 calls per ASDiv example\. The trigger rate is 38\.5% on GSM8K \(508/1,319\) and 67\.3% on ASDiv \(673/1,000\), while the accepted\-repair rate among triggered examples is 3\.94% and 14\.1%, respectively\. Appendix[A\.7](https://arxiv.org/html/2605.24613#A1.SS7)discusses howNNcontrols the recall–cost tradeoff\.
## 7Limitations and Threats to Validity
The study is limited to arithmetic word problems with mostly numeric final answers; geometry, algebra proofs, program synthesis, and open\-ended reasoning may require different normalization and diagnostics\. The surface semantic\-risk graph checker is heuristic rather than a complete parser: it is designed as a conservative risk\-control feature for quantity binding, comparison, answer\-format, and related risks, not as a high\-precision semantic\-understanding module\. It can miss subtle errors and frequently emits benign warnings\. We therefore evaluate it through downstream ablations, candidate\-flow analysis, qualitative examples, diagnostic logs, and a small stratified manual audit rather than as a standalone parser\.
Most non\-main configurations use a single run, and API\-based repair generation may vary even at temperature 0\.0\. For the main GSM8K setting, Appendix[B\.15](https://arxiv.org/html/2605.24613#A2.SS15)reports three additional repair\-stage reruns with fixed cached initial traces; all remain positive relative to the 95\.60% initial accuracy, but rare broken\-correct cases appear\. Thus zero measured harm in the main GSM8K and seed\-42 ASDiv runs should not be interpreted as proof that the system cannot break correct answers\. Additional SVAMP, MultiArith, ASDiv multi\-seed, and GSM8K rerun checks show that the current guards reduce but do not eliminate replacement risk\. The local Qwen repair\-model check further shows that the protocol is not tied to a single API repair model, but also that repair recall is model\-dependent\.
Finally, the main weak\-reasoner ASDiv result uses a 1,000\-example sample from a 2,147\-example numeric pool to control repair\-stage cost\. We mitigate sampling concerns by reporting three additional uniformly sampled seeds, which show consistent positive gains, but this is still not a full\-pool evaluation\. The multi\-seed results should be interpreted as a sampling\-sensitivity check rather than a replacement for evaluating all 2,147 numeric examples\. The numeric filtering also excludes yes/no, comparison, and other non\-numeric answer types, leaving broader answer formats for future work\.
## 8Reproducibility Statement
We release evaluation code, answer\-normalization scripts, diagnostic rules, guarded acceptance policy, prompts, thresholds, dataset split files, cached initial traces, repair candidate logs, diagnostic outputs, and final prediction files at[https://github\.com/Haizhoux0517/guarded\-repair](https://github.com/Haizhoux0517/guarded-repair)\. These artifacts allow repair\-stage analysis to be reproduced without regenerating initial model outputs\. For ASDiv numeric\-1000, we will release full numeric\-pool IDs, sampled IDs for all reported seeds, filtering scripts, sampling seeds, and rejected\-category counts, so that the pre\-run uniform sampling procedure can be audited\. We also release the 50\-case stratified semantic\-graph audit file, its manual annotations, and the corresponding summary statistics underartifacts/semantic\_graph\_audit/gsm8k\_main/\. We also report local Qwen repair\-model runs to provide a reproducible non\-API portability check\. API model names, local model names, decoding parameters, token budgets, JSON schema, and retry policy are reported in Appendices[B\.1](https://arxiv.org/html/2605.24613#A2.SS1),[B\.10](https://arxiv.org/html/2605.24613#A2.SS10), and[B\.17](https://arxiv.org/html/2605.24613#A2.SS17)\.
## 9Conclusion
We presentedGuardedRepairas a harm\-aware selective replacement protocol for post\-hoc repair of LLM mathematical reasoning traces\. The system diagnoses cached traces, repairs only risky cases, and accepts replacements only when deterministic guards support changing the original answer\. Across GSM8K, numeric ASDiv, and weak\-reasoner robustness checks, the method improves final accuracy and substantially reduces harm relative to direct strong\-model regeneration\. Multi\-seed, rerun, robustness, and local repair\-model experiments also clarify the boundary: guarded repair reduces replacement risk, but rare broken\-correct cases can still occur and repair recall depends on candidate quality\. These findings support treating post\-hoc repair as selective replacement: fixing wrong traces matters, but a repair system should also be judged by whether it preserves traces that were already correct\.
## References
- Wei et al\. \(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V\. Le, and Denny Zhou\.2022\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Advances in Neural Information Processing Systems*\.
- Zhou et al\. \(2023\)Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V\. Le, and Ed H\. Chi\.2023\.Least\-to\-most prompting enables complex reasoning in large language models\.In*International Conference on Learning Representations*\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.2021\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*\.
- Lightman et al\. \(2023\)Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.2023\.Let’s verify step by step\.*arXiv preprint arXiv:2305\.20050*\.
- Madaan et al\. \(2023\)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark\.2023\.Self\-Refine: Iterative refinement with self\-feedback\.In*Advances in Neural Information Processing Systems*\.
- Yao et al\. \(2023\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.2023\.ReAct: Synergizing reasoning and acting in language models\.In*International Conference on Learning Representations*\.
- Gao et al\. \(2023\)Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig\.2023\.PAL: Program\-aided language models\.In*Proceedings of the 40th International Conference on Machine Learning*\.
- Chen et al\. \(2023\)Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W\. Cohen\.2023\.Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks\.*Transactions on Machine Learning Research*\.
- Hosseini et al\. \(2014\)Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman\.2014\.Learning to solve arithmetic word problems with verb categorization\.In*Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing*\.
- Patel et al\. \(2021\)Arkadiy Patel, Satwik Bhattamishra, and Navin Goyal\.2021\.Are NLP models really able to solve simple math word problems?In*Proceedings of NAACL\-HLT*\.
- Miao et al\. \(2020\)Shen\-Yun Miao, Chao\-Chun Liang, and Keh\-Yih Su\.2020\.A diverse corpus for evaluating and developing English math word problem solvers\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*\.
- She et al\. \(2025\)Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang\.2025\.R\-PRM: Reasoning\-driven process reward modeling\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*\.
- Wu et al\. \(2025\)Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, and Meng Jiang\.2025\.Enhancing mathematical reasoning in LLMs by stepwise correction\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics*\.
- Xiong et al\. \(2025\)Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang\.2025\.Self\-rewarding correction for mathematical reasoning\.*arXiv preprint arXiv:2502\.19613*\.
- Zhang et al\. \(2025\)Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An\.2025\.Incentivizing LLMs to self\-verify their answers\.*arXiv preprint arXiv:2506\.01369*\.
- El\-Yaniv and Wiener \(2010\)Ran El\-Yaniv and Yair Wiener\.2010\.On the foundations of noise\-free selective classification\.*Journal of Machine Learning Research*, 11:1605–1641\.
- Geifman and El\-Yaniv \(2017\)Yonatan Geifman and Ran El\-Yaniv\.2017\.Selective classification for deep neural networks\.In*Advances in Neural Information Processing Systems*\.
## Appendix AAdditional Results and Positioning
### A\.1Positioning Relative to Prior Paradigms
Table[5](https://arxiv.org/html/2605.24613#A1.T5)summarizes the main distinction from adjacent paradigms\. Unlike methods that re\-solve, rerank, or abstain,GuardedRepairdecides whether replacing an existing cached trace is safer than preserving it\.
Table 5:Positioning ofGuardedRepairrelative to related paradigms\. The key distinction is harm\-aware selective replacement over an existing cached trace\.
### A\.2Repair Behavior Details
This section gives the full repair\-behavior accounting referenced in Section[1](https://arxiv.org/html/2605.24613#S5.T1)\. The table reports how often repair is triggered, how many candidate replacements are accepted, and how many accepted repairs are true wrong\-to\-correct fixes\.
Table 6:Repair behavior under guarded best\-of\-3\. Accepted precision is the percentage of accepted repairs that are wrong\-to\-correct transitions; accepted repairs that remain wrong lower precision even when they do not create harm\.
### A\.3Weak\-Reasoner Robustness Results
This section expands the weak\-reasoner robustness summary from the main text\. All rows useqwen2\.5:1\.5bas the initial reasoner and the same guarded best\-of\-3 repair stage\.
Table 7:Weak\-reasoner robustness checks\. The method improves all three weak\-initial settings, with larger gains when more repairable error mass is available\. Harm is the percentage of all examples that become incorrect after initially being correct\.
### A\.4Candidate Flow Details
This section decomposes the repair pipeline over initially wrong examples\. The counts separate errors that remain because no correct candidate is generated from errors that remain because a correct candidate is rejected by the guarded acceptance policy\.
Table 8:Candidate flow over initially wrong examples\. InitW/TrigW are initially wrong and triggered\-wrong counts; CorrC/AccC/RejC are correct\-candidate, accepted\-correct, and rejected\-correct counts\.
### A\.5Qualitative Repair Examples and Outcome Taxonomy
Table[9](https://arxiv.org/html/2605.24613#A1.T9)gives representative GSM8K repair outcomes selected from the analysis logs\. The cases illustrate that the guarded policy is not only a final\-answer filter\. It can accept a repair when the candidate restores a missing semantic relation, reject a plausible but unsafe answer change when the original is already correct, reject a correct candidate when residual diagnostic warnings remain, or accept a locally clean candidate that is still wrong\. The last two cases correspond to the recall–precision boundary in Table[8](https://arxiv.org/html/2605.24613#A1.T8): stricter gates reduce broken\-correct cases, but accepted precision is not perfect\.
Table 9:Representative qualitative repair outcomes\. The examples show how guarded acceptance fixes some errors, avoids unsafe replacements, rejects some correct candidates, and occasionally accepts a still\-wrong candidate when local evidence is insufficient\.
### A\.6Qualitative Outcome Taxonomy
Table[10](https://arxiv.org/html/2605.24613#A1.T10)summarizes the main qualitative outcome classes observed in the analysis logs\. The table is not a separate human annotation study; rather, it aggregates the same fixed/broken accounting, candidate\-flow logs, diagnostic categories, and rejection counters used in the quantitative analysis\. It complements the individual examples in Table[9](https://arxiv.org/html/2605.24613#A1.T9)by showing where the repair pipeline succeeds and where it remains conservative\.
Table 10:Qualitative outcome taxonomy from the analysis logs\. Counts summarize where the repair system fixes errors, where it accepts still\-wrong repairs, where conservative gates reject correct candidates, and where residual harm appears in robustness or direct\-regeneration baselines\.The taxonomy highlights three recurring patterns\. First, the main gains come from answer\-changing repairs that convert initially wrong examples into correct ones, especially in the weak\-reasoner settings\. Second, some generated correct candidates remain rejected; this is the main source of lost recall and explains why the system does not fix every initially wrong example\. Third, the rare broken\-correct cases in SVAMP, MultiArith, and the direct baseline arise from plausible answer\-changing candidates that pass local checks but reinterpret a quantity relation differently from the gold solution\. These cases support the paper’s main design choice: repair should be treated as guarded selective replacement rather than unconstrained regeneration\.
### A\.7Cost and Practicality
Best\-of\-3 increases inference cost, but the trigger policy avoids re\-solving every example\. On GSM8K, repair is triggered for 508 of 1,319 examples \(38\.5%\) and makes 1,498 repair attempts, or 1\.14 strong\-model calls per dataset example\. Only 20 triggered examples are accepted \(3\.94% of triggered cases\), reflecting the high baseline and conservative acceptance policy\. On ASDiv, repair is triggered for 673 of 1,000 examples \(67\.3%\) and makes 1,859 attempts, or 1\.86 calls per example; 95 triggered examples are accepted \(14\.1%\), reflecting the larger repairable error mass in the weak\-reasoner setting\.
The parameterNNcontrols a recall–cost tradeoff\. LowerNNreduces strong\-model calls but can miss correct repairs, while higherNNincreases candidate recall and also increases the number of candidates that must be rejected by the guards\. In applications where safety and auditability are more important than raw throughput, the selectedN=3N=3setting is a conservative operating point; for lower\-cost deployment,NNor the trigger threshold can be reduced\.
## Appendix BImplementation Details and Reproducibility
This appendix summarizes the implementation details needed to reproduce the reported experiments\. The main experiments use cached initial reasoning traces, so the repair stage can be rerun without regenerating the initial model outputs\.
### B\.1Model and Decoding Configuration
Table[11](https://arxiv.org/html/2605.24613#A2.T11)reports the model configuration used in the main experiments\. The exact API snapshot may affect reproducibility, so the implementation caches initial reasoning traces and reports repair\-stage decoding settings explicitly\. For API\-based models, provider\-side snapshots may change over time; the reported results therefore rely on cached initial traces and saved repair candidate outputs\. Re\-running the repair\-stage analysis from these cached artifacts does not require regenerating the initial model outputs\.
Table 11:Model configuration\. DS\-Flash and DS\-Pro denotedeepseek\-v4\-flashanddeepseek\-v4\-pro; Qwen\-1\.5B, Qwen\-7B, and Qwen\-14B denoteqwen2\.5:1\.5b,qwen2\.5:7b, andqwen2\.5:14b\. The main repair model is used for the primary GSM8K and ASDiv results; local Qwen models are used only for the repair\-model portability check\.
### B\.2Answer Extraction and Normalization
The final\-answer parser first searches for explicit answer markers such asFinalAnswer:,finalansweris, andanswer:\. If such markers are found, it extracts the last answer\-like value in the marked span\. Supported answer forms include integers, decimals, comma\-formatted numbers, fractions, colon\-formatted time or ratio strings, and yes/no answers\. If no explicit marker is found, the parser falls back to the last answer\-like value in the reasoning trace, but the repair pipeline itself requires clean repair candidates to contain exactly oneFinalAnswerline\.
For evaluation, answers are normalized by stripping punctuation and unit parentheses, removing thousands separators, normalizing decimal integers such as12\.0to12, and comparing exact normalized strings before applying numeric, fraction, and ratio equivalence\. This normalization supports examples such as1,059,955versus1059955,12versus12\.0, and2/3versus4/6\.
### B\.3ASDiv Numeric Filtering
The ASDiv numeric subset is constructed from a locally stored ASDiv\-derived JSONL pool of 2,305 examples, converted into the common input format used by the pipeline\. The preprocessing script first normalizes answers while preserving fractions, comma\-formatted numbers, and colon\-formatted time or ratio strings\. We then keep only examples whose normalized answers are plain numeric values: integers, decimals, or fractions\. We exclude yes/no answers, likely categorical question types, and non\-numeric or ambiguous answer formats\. Colon\-formatted answers such as2:3or12:50are excluded because they may represent either ratios or times\. This yields 2,147 numeric examples, from which we uniformly sample 1,000 examples with seed 42 before running any repair experiment\. The 1,000\-example subset is used to control repair\-stage API cost and is not selected by model performance\. To make the sampling decision auditable, we release the full numeric\-pool IDs, sampled IDs for all reported seeds, filtering scripts, sampling seeds, and rejected\-category counts\.
Table 12:Construction of the numeric ASDiv subset\. The final 1,000\-example evaluation set is uniformly sampled from the numeric pool with seed 42 before repair runs\.
### B\.4Surface Semantic\-Risk Graph Algorithm
The surface semantic\-risk graph checker is implemented with deterministic surface\-pattern rules rather than an LLM, a learned verifier, or a dependency parser\. The checker first extracts numeric mentions from the problem and reasoning trace after normalizing number words, decimals, fractions, commas, and simple money expressions\. For each numeric mention, it stores a local token window and derives a lightweight node annotation consisting of the surface number, normalized value, unit phrase, entity mention, and local predicate context\.
Unit phrases are extracted from nearby content words following or preceding the number, after removing stopwords and arithmetic function words\. Entity mentions are approximated by nearby noun\-like tokens and named entities in the same local window\. Predicate context is represented by nearby verbs and comparative markers such asmore,fewer,left,total,each,per,gave,bought,removed, andremaining\. These annotations are diagnostic features rather than a full semantic parse\.
Edges are added by deterministic relation templates\. Aggregation edges are triggered by markers such astotal,together,inall, or additive equations\. Comparison edges are triggered bymorethan,fewerthan,lessthan, and related comparative forms\. Rate edges are triggered byeach,per,every, and multiplicative contexts\. Change\-event edges are triggered by verbs such asgave,lost,spent,removed,bought,received, oradded\. Part\-whole edges are triggered when a total quantity is paired with subgroup quantities\.
The graph score is a heuristic risk score used only for triggering and guarded acceptance\. Starting from 1\.0, the checker subtracts penalties for high\-risk mismatches, including quantity\-binding conflicts, reversed or ignored comparison relations, missing rate multiplication, change\-event direction errors, answer\-format mismatches, and generation failures\. The final score is clipped to\[0,1\]\[0,1\]\. The graph guard accepts a candidate only if it avoids generation failure, has no high\-risk semantic issue, and its score is at least the configured minimum threshold\. In the main experiments, this threshold is 0\.60\. This value is a conservative engineering threshold and is not tuned on test labels; we evaluate its downstream role through the semantic\-graph ablation and the manual risk\-signal audit\.
defsemantic\_graph\_check\(problem,trace\):
q\_problem=extract\_quantities\(problem\)
q\_trace=extract\_quantities\(trace\)
g\_problem=build\_relation\_graph\(q\_problem,problem\)
g\_trace=build\_relation\_graph\(q\_trace,trace\)
risks=\[\]
risks\+=check\_quantity\_binding\(g\_problem,g\_trace\)
risks\+=check\_comparisons\(g\_problem,g\_trace\)
risks\+=check\_rate\_usage\(g\_problem,g\_trace\)
risks\+=check\_change\_events\(g\_problem,g\_trace\)
risks\+=check\_answer\_format\(problem,trace\)
score=1\.0
forriskinrisks:
score\-=penalty\(risk\)
score=clip\(score,0\.0,1\.0\)
returnscore,risks
### B\.5Surface Semantic\-Risk Taxonomy
Table[13](https://arxiv.org/html/2605.24613#A2.T13)summarizes the main surface semantic\-risk graph risk categories and their trigger conditions\. The checker emits risk categories rather than correctness labels; these categories are used as trigger signals and as part of guarded candidate acceptance\.
Risk typeTrigger conditionExample failurequantity\_bindingSame number used with different nearby entity/unitApples assigned to orangescomparison\_warningComparative marker is reversed, ignored, or unsupportedUses 5 instead of base\+5\+5per\_entity\_rate\_missingeach/perquantity lacks entity multiplier33bags×\\times44candies treated as44change\_event\_misinterpretationAdd/remove/spend/give event has wrong directionHad 10, gave away 3→\\rightarrowuses 13answer\_format\_warningRequested value type mismatches final answer typeOutputs total instead of differenceTable 13:Surface semantic\-risk graph categories\. The checker uses deterministic surface\-pattern rules and emits heuristic risk signals rather than proof\-level correctness labels\.
### B\.6Surface Semantic\-Risk Graph Diagnostic Logs
The surface semantic\-risk graph checker is not evaluated as a standalone semantic parser\. Instead, we inspect the diagnostic\-log patterns produced during the main GSM8K run to understand how graph signals participate in triggering and acceptance decisions\. These logs summarize acceptance\-decision patterns rather than all individual model calls, so the counts below should be interpreted as implementation diagnostics, not as semantic\-parser precision or recall\.
Table 14:Surface semantic\-risk graph diagnostic\-log summary on the GSM8K main run\. The table summarizes acceptance\-decision patterns rather than a standalone semantic\-parser evaluation\. No\-op candidates are rejected when the final answer does not change\.Table 15:Initial semantic graph risk types observed in the diagnostic\-log patterns\. The distribution shows that the graph checker primarily detects quantity\-binding risks, while also covering comparison, rate, change\-event, and answer\-format warnings\.The log summary supports the interpretation of the surface semantic\-risk graph checker as a downstream risk\-control module\. In the inspected patterns, all accepted answer\-changing candidates are graph\-clean, while graph\-risk patterns are dominated by quantity\-binding errors and include comparison, rate, change\-event, and answer\-format warnings\. This evidence complements the ablation in Table[2](https://arxiv.org/html/2605.24613#S5.T2): removing the graph gate does not catastrophically increase measured harm in that run, but it reduces fixed errors and final accuracy\. The rejected unsafe repair in Table[9](https://arxiv.org/html/2605.24613#A1.T9)illustrates the intended safety role: a candidate would change an initially correct answer, but residual graph/format warnings cause the system to preserve the cached trace\.
### B\.7Manual Audit of Surface Graph Risk Signals
To check whether surface semantic\-risk graph warnings correspond to human\-identifiable risks, we manually audit 50 stratified non\-none initial graph\-risk cases from the main GSM8K run\. The sample is stratified over emitted risk types, so it is intended as a qualitative audit of risk coverage rather than an estimate of parser precision or end\-to\-end guard precision\. The audit labels each case as a clear semantic risk, a conservative/benign but relevant warning, or no identifiable risk\. Table[16](https://arxiv.org/html/2605.24613#A2.T16)summarizes the results\.
Table 16:Manual audit of 50 stratified non\-none surface graph risk cases from the main GSM8K run\. The audit checks whether emitted graph\-risk signals correspond to identifiable semantic\-risk patterns; it is not a standalone semantic\-parser evaluation or a high\-precision parser benchmark\.The audit covers quantity\-binding warnings \(25 cases\), comparison warnings \(12\), combined quantity\-binding/comparison warnings \(7\), answer\-format warnings \(3\), and one case each of split interpretation, change\-event interpretation, and per\-entity rate omission\. All audited warnings correspond to at least a conservative semantic\-risk pattern, but most are benign warnings on traces that ultimately remain correct\. This supports using the graph checker as an auditable risk\-control signal while reinforcing that it should not be interpreted as a high\-precision standalone parser\. In particular, the 88% conservative/benign\-warning rate indicates that the graph feature is deliberately recall\-oriented: it is useful for preventing unsafe replacement and shaping trigger decisions, but its warnings should be filtered by the downstream acceptance policy rather than treated as correctness judgments\.
### B\.8Repair Trigger Policy
The medium\-strength repair trigger is designed to avoid repairing every sample\. In pseudocode, repair is attempted when any of the following conditions hold:
- •the initial reasoning is empty;
- •the meta\-diagnosis isgeneration\_failure,arithmetic\_error, orlogical\_contradiction;
- •the semantic graph diagnosis isgeneration\_failure;
- •a high\-risk semantic issue is detected, includingtimes\_more\_interpretation,per\_entity\_rate\_missing,equally\_split\_interpretation, orchange\_event\_misinterpretation;
- •the meta\-diagnosis ismissing\_constraintand the global meta score is below 0\.90;
- •the global meta score is below 0\.65\.
Otherwise, the original reasoning is preserved without calling the repair model\.
deftrigger\(meta,graph,trace\):
ifempty\(trace\):returnTrue
ifmeta\.errin\{gen\_fail,arith\_err,logic\_err\}:
returnTrue
ifgraph\.err==gen\_fail:
returnTrue
ifhigh\_risk\_graph\(graph\):
returnTrue
ifmeta\.err==missing\_constraint:
returnmeta\.score<0\.90
returnmeta\.score<0\.65
### B\.9Best\-of\-NNRepair Selection
For each triggered sample, the system generates up toNNrepair candidates\. The final experiments useN=3N=3\. Each candidate is converted from JSON into the standardized trace format and is then checked for output cleanliness, symbolic consistency, constraint coverage, meta\-consistency, semantic graph quality, and equation support\. The system selects the first candidate that passes both the semantic graph guard and the deterministic guarded acceptance policy\. If no candidate is accepted, the final reasoning remains the original initial reasoning\.
foriinrange\(N\):
cand=repair\_llm\(x,r0,diag0,i\)
ifnotclean\(cand\):
continue
diag\_c=diagnose\(x,cand\)
graph\_ok=graph\_guard\(graph0,diag\_c\.graph\)
accept=accept\_policy\(r0,cand,diag0,diag\_c\)
ifgraph\_okandaccept:
returncand
returnr0
### B\.10Prompt Strategies
The three repair attempts use the same JSON schema but different attempt strategies:
- •Attempt 0 uses the deterministic diagnostic hint as guidance when helpful and changes the answer only if the previous answer is not supported by the problem\.
- •Attempt 1 prioritizes strict formatting and concise arithmetic, and instructs the model to preserve the original answer if it is defensible\.
- •Attempt 2 solves from the original problem in at most four compact steps, treating the initial reasoning only as a warning signal\.
All attempts require a JSON object with astepslist and a number\-onlyfinal\_answer\. The format retry uses the same schema and asks the model to rewrite only the malformed output, without prose outside JSON\.
The repair prompt follows this skeleton:
System:ReturnonlyvalidJSON\.Nomarkdown\.
Schema:
\{
"steps":\["shortarithmeticstep"\],
"final\_answer":"number"
\}
Rules:
\-Useatmost4steps\.
\-Preferarithmeticequations\.
\-final\_answerisnumber\-only\.
\-Usebutdonotmentionthehint\.
\-Attemptstyle:<style\>
Problem:<problem\>
Initial:<initialreasoning\>
Hint:<diagnostichint\>
Semanticerror:<semanticerror\>
Metaerror:<metaerror\>
### B\.11Candidate Cleanliness
A candidate is considered clean only if it is non\-empty, contains a parseable final answer, includes exactly oneFinalAnswerline, and is not excessively long\. The implementation also rejects candidates containing meta\-discussion phrases such aspreviousreasoning,thediagnosissays,providedhint,thisisambiguous, or references to the prompt itself\. This prevents verbose self\-analysis from being accepted as a repaired mathematical solution\.
### B\.12Equation\-Support Guard
The equation\-support guard checks whether the candidate final answer is explicitly supported by the candidate reasoning\. It recognizes:
- •standard arithmetic equations, e\.g\.,276/12=23;
- •LCM/GCD equations, e\.g\.,LCM\(6,5\)=30;
- •restricted derivational statements such asNumberoftrays=23orThegreatestcommondivisoris15\.
The guard intentionally does not accept arbitrary copied quantities, such asTimesaved=64, because such statements may simply repeat a number from the problem without deriving the answer\. The ablation without this guard introduces two broken\-correct cases on GSM8K, supporting its role as a safety condition\.
### B\.13Accepted Repair Outcome Accounting
Accepted repair precision is defined as
\#\(wrong→correct accepted repairs\)\#\(accepted repairs\)\.\\frac\{\\\#\(\\text\{wrong\}\\rightarrow\\text\{correct accepted repairs\}\)\}\{\\\#\(\\text\{accepted repairs\}\)\}\.\(4\)This is a conservative measure: accepted repairs that do not fix an initially wrong example lower precision even if they do not create harm\. Table[17](https://arxiv.org/html/2605.24613#A2.T17)decomposes accepted repairs by before/after correctness transition\.
Table 17:Accepted repair outcomes\. W→\\toC repairs are fixed errors; C→\\toW repairs are broken\-correct cases; W→\\toW repairs are accepted but unsuccessful modifications; C→\\toC repairs are unnecessary but non\-harmful modifications\.
### B\.14ASDiv Multi\-Seed Subset Check
The main ASDiv experiment uses a 1,000\-example subset to control repair\-stage cost\. Because the filtered numeric pool contains 2,147 examples, sampling can introduce variance\. We therefore run three additional uniformly sampled 1,000\-example subsets using seeds 0, 1, and 2\. These subsets are sampled from the same filtered numeric pool before repair is run, and no seed is selected based on repair performance\. All runs use the same weak initial reasoner, repair model, and guarded best\-of\-3 configuration\. Table[18](https://arxiv.org/html/2605.24613#A2.T18)reports the original seed\-42 run together with seeds 0, 1, and 2\.
SeedInit\.FinalΔ\\DeltaFixedBrokenAcc\.4278\.4087\.60\+9\.2092095079\.5088\.00\+8\.5087291179\.9088\.60\+8\.7091497278\.8088\.60\+9\.801046113Mean79\.1588\.20\+9\.0593\.503\.0099\.00Std\.0\.590\.420\.506\.342\.248\.37Table 18:ASDiv multi\-seed subset check\. Each row uses a uniformly sampled 1,000\-example subset from the same 2,147\-example numeric pool\. All seeds show consistent positive gains, while additional seeds reveal rare broken\-correct cases\. “Acc\.” denotes accepted repairs\.Across four seeds, final accuracy ranges from 87\.60% to 88\.60%, and absolute improvement ranges from \+8\.50 to \+9\.80 percentage points, with mean improvement \+9\.05 and standard deviation 0\.50\. Thus, the ASDiv gain is not unique to seed 42\. However, the additional seeds also introduce rare broken\-correct cases, so the seed\-42 zero\-harm observation should not be interpreted as a universal property of the weak\-reasoner setting\. Since only four sampled subsets are evaluated and the subsets may overlap, we treat the mean, standard deviation, and range as a sampling\-sensitivity check rather than a definitive confidence interval for the full 2,147\-example numeric pool\.
### B\.15Repair\-Stage Stability Reruns
To assess repair\-stage variability, we rerun the GSM8K repair stage three additional times using the same cached initial traces and the same guarded best\-of\-3 configuration\. The initial traces are fixed, so variation comes from API\-side repair generation and formatting/retry behavior\. Table[19](https://arxiv.org/html/2605.24613#A2.T19)reports the original main run together with the three reruns\.
Table 19:Repair\-stage stability on GSM8K using fixed cached initial traces\. All reruns improve over the 95\.60% initial accuracy, but rare broken\-correct cases appear under API\-side variation\.
### B\.16Acceptance Policy
The acceptance policy rejects no\-op repairs whose final answer does not change\. It then considers several acceptance paths, including high\-risk semantic repair, empty\-generation rescue, very\-low\-confidence rescue, restricted clean semantic improvement, relaxed support acceptance, and weak\-reasoner relaxed acceptance\. The strong\-reasoner GSM8K relaxed path accepts meta\-clean orlow\_symbolic\_coveragecandidates only when they are graph\-clean, sufficiently high\-scoring, and equation\-supported\. The main GSM8K configuration does not use the relaxedmissing\_constraintpath, because this category may indicate a genuinely omitted quantity\. In the ablation, relaxingmissing\_constraintproduces the same final accuracy but accepts more modifications without fixing more errors\.
### B\.17Thresholds
Table[20](https://arxiv.org/html/2605.24613#A2.T20)summarizes the main thresholds used by the final pipeline\.
Table 20:Main thresholds and decoding settings\.
### B\.18Ablation Configurations
The GSM8K ablations are implemented by environment variables\.LLM\_REPAIR\_NUM\_CANDIDATEScontrolsNN\.ENABLE\_GRAPH\_GUARD=falsedisables the semantic graph guard\.DISABLE\_EQUATION\_SUPPORT\_GUARD=truedisables the equation\-support requirement\.RELAX\_MISSING\_CONSTRAINT\_ACCEPT=trueenables an ablation\-only acceptance path formissing\_constraintcandidates\. These ablations are not used in the main configuration unless explicitly stated\.
The direct strong\-model baselines are implemented separately from the main repair pipeline\.solve\_allre\-solves every GSM8K example with the strong repair model and accepts all outputs\.solve\_triggeredre\-solves only examples triggered by the mainGuardedRepairrun and accepts all outputs\.direct\_bestof3\_gateduses the same triggered set, strong model,N=3N=3budget, and acceptance gates asGuardedRepair, but removes the initial trace and diagnostic hint from the repair prompt\.Similar Articles
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
This paper introduces LGMT, a framework that uses first-order logic to generate semantically invariant test cases for evaluating LLM reasoning reliability. Experiments on six LLMs show that LGMT exposes hidden defects missed by static benchmarks, suggesting evaluation should focus on robustness under logical invariance.
Reasoning Can Be Restored by Correcting a Few Decision Tokens
This paper shows that the reasoning gap between base LLMs and large reasoning models is concentrated on a small set of early planning tokens. It introduces disagreement-guided token intervention, where replacing only those critical tokens with a reasoning model's outputs allows a base model to nearly match the reasoning model's performance.
Robust and Efficient Guardrails with Latent Reasoning
CoLaGuard is a new guardrail model that transfers multi-step safety reasoning into a continuous latent space, achieving 12.9x speedup and 22.4x token reduction compared to explicit reasoning baselines while matching macro-F1 performance on ten safety benchmarks.
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
This paper introduces ReFlect, a training-free harness system that wraps LLMs with deterministic error detection and recovery logic to improve performance on complex, long-horizon reasoning tasks.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.