Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Summary
Empirical study on LLM formal-math reasoning finds a single-prompt ceiling: accuracy plateaus around 60–79% regardless of prompt size, driven by undecidability, model fragility, and distribution mismatch.
View Cached Full Text
Cached at: 04/22/26, 08:29 AM
# Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Source: [https://arxiv.org/html/2604.18897](https://arxiv.org/html/2604.18897)
Manuel Israel Cázares Bytepro AI Mazatlán, Sinaloa, Mexico hello@bytepro\.ai\|\|israel\.cazares@gmail\.com
\(April 2026\)
###### Abstract
We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition\. The task requires deciding whether one equational law implies another over all magmas \(algebraic structures with a single binary operation\) — a problem that is undecidable in general but decidable forFalsevia finite model search\. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models \(gpt\-oss\-120b, Llama 3\.3 70B, Gemma 4 31B\)\.
Our central finding is asingle\-prompt ceiling— or more precisely, we term this anempirical saturation region: a zone where accuracy improvements become unstable and non\-generalizable across problem distributions, rather than an absolute theoretical limit\. Despite substantial engineering effort, balanced hard accuracy plateaus at approximately 60–79% for gpt\-oss\-120b, compared to a 59\.75% no\-cheatsheet baseline \(95% CI: \[54\.9%, 64\.4%\]\)\. We identify three mechanisms underlying this saturation region: \(1\) the mathematical undecidability of theTruecase limits what any finite prompt can encode; \(2\) complex rule systems decrease performance on weaker models \(Llama 3\.3 70B collapses to 0%Truerecall with prompts exceeding≈\\approx2KB\); and \(3\) prompt ordering effects interact with model attention in fragile, non\-monotonic ways\.
We also document adistribution mismatch failure mode: a rule that appeared correct when validated on aFalse\-heavy subset \(hard1, 35%True\) catastrophically failed on a balanced subset \(hard2, 50%True\), incorrectly blocking 51% ofTrueproblems\. Our best submission \(AN45c, 2,252 bytes\) achieves79\.25%accuracy on hard3 \(n=400n=400; 95% CI: \[75\.0%, 82\.9%\]\), withTruerecall of 95\.9% andFalserecall of 63\.4%, representing a \+19\.5 percentage\-point improvement over the no\-cheatsheet baseline \(59\.75%; 95% CI: \[54\.9%, 64\.4%\]\)\. A cross\-provider validation run on OpenRouter/DeepInfra bf16 \(n=20n=20\) yielded 90–95%, consistent with the full\-scale result\. The key design decision — placing the trivial\-magma check before the counterexample table — accounts for the primary performance gain over its predecessor AN38 \(71\.8%\), not the addition of new content\. We release all prompt variants, evaluation scripts, and results tables\.
Post\-submission validation against the SAIR official benchmark reveals a cross\-distribution trade\-off surface: gains within the saturation region are distribution\-fragile, with our most engineered variant \(AN45c\) falling below the no\-cheatsheet baseline on the official evaluation while the simpler predecessor \(AN38\) produces a robust improvement\. See Section[9](https://arxiv.org/html/2604.18897#S9)for full analysis\.
Note: This is a pre\-competition\-leaderboard version based on Contributor Network data \(n=52n=52voluntary submissions at competition close, April 20, 2026\)\. An updated analysis incorporating full competition results \(n=1,007n=1\{,\}007\) will follow after April 30, 2026\.
## 1Introduction
Large language models have demonstrated surprising competence on mathematical reasoning tasks, yet their behavior on problems requiring formal logical completeness — where a single counterexample suffices to disprove a claim — remains poorly understood\. The SAIR Equational Theories Stage 1 competition\[[13](https://arxiv.org/html/2604.18897#bib.bib3)\]provides an unusually clean testbed for studying this question: given two equations over magmas \(sets with a single binary operation\), decide whether the first implies the second universally\. The problem is computationally asymmetric:Falseinstances can in principle be certified by exhibiting a small finite counterexample, whileTrueinstances require a universal proof that no counterexample exists, including infinite algebraic structures\.
Unlike prompt engineering for standard benchmarks such as GSM8K\[[3](https://arxiv.org/html/2604.18897#bib.bib6)\], this task involves semi\-decidable algebraic implication whereFalseis certifiable via finite counterexample butTruerequires universal quantification over all magmas — a fundamentally different reasoning regime\.
This asymmetry creates a natural design space for prompt engineering\. One might expect that providing a language model with a library of known counterexamples would systematically improveFalseaccuracy, while a brief instruction about singleton\-forcing equations would improveTrueaccuracy\. Our experiments reveal a more complex picture\.
Over five weeks of systematic experimentation prior to the April 20, 2026 competition deadline, we tested more than 40 prompt variants on over 1,000 labeled problems across four dataset splits\. We found that prompt complexity and multi\-model generalization are inversely related: the prompts that most improved gpt\-oss\-120b’s performance onFalse\-heavy subsets \(e\.g\., AN3c’s 4,306\-byteBlocksystem achieving 78\.3% on hard1\) performed at or below baseline on balanced subsets and collapsed to near\-zeroTruerecall on Llama 3\.3 70B\. Conversely, the most compact effective prompt \(AN19c, 289 bytes\) performed within 2 percentage points of complex variants on gpt\-oss\-120b while being the only variant that maintained meaningfulTruerecall on Llama\.
The phenomenon we document — call it thesingle\-prompt ceiling— manifests as a bound on what a static text prompt can accomplish when the underlying task requires mathematical reasoning that the base model has not internalized\. We provide empirical evidence that this ceiling lies at approximately 60–79% balanced hard accuracy for gpt\-oss\-120b, a model that achieves only 26\.5%Falserecall with default reasoning and no cheatsheet on the official benchmark \(our own controlled baseline measurement yields 38\.0%Falserecall under Together AI inference; see Section[5](https://arxiv.org/html/2604.18897#S5)\)\. A prompt cannot teach a model the mathematics it does not know; it can only guide it to apply what it does know more reliably\.
We make the following contributions:
1. 1\.Systematic ablation study: 40\+ prompt variants tested on labeled splits, enabling controlled analysis of which design choices matter\.
2. 2\.Distribution mismatch failure mode: Quantified how validation onFalse\-heavy data produces incorrect conclusions when the target distribution is balanced\.
3. 3\.Multi\-model generalization analysis: First systematic study of prompt portability across gpt\-oss\-120b, Llama 3\.3 70B, and Gemma 4 31B on this task\.
4. 4\.Ordering effect: Evidence that trivial\-magma\-first ordering \(AN45c\) outperforms CE\-table\-first ordering \(AN38\) by\+7\.5\+7\.5percentage points at full scale \(n=400n=400\), with non\-overlapping 95% Wilson CIs \(\[75\.0%, 82\.9%\] vs\. \[67\.1%, 75\.9%\]\) \(local evaluation; see Section[9](https://arxiv.org/html/2604.18897#S9)for official benchmark divergence\)\.
5. 5\.Practical guidelines: Minimal effective prompts for multi\-model deployment, with analysis of why simpler beats complex\.
6. 6\.Cross\-distribution trade\-off surface\.Official benchmark validation reveals that local gains within the saturation region do not transfer across problem distributions \(Section[9](https://arxiv.org/html/2604.18897#S9)\)\.
The rest of the paper is organized as follows\. Section[2](https://arxiv.org/html/2604.18897#S2)provides background on magmas, equational implication, and the Equational Theories Project\. Section[3](https://arxiv.org/html/2604.18897#S3)reviews related work\. Section[4](https://arxiv.org/html/2604.18897#S4)describes our methodology\. Sections[5](https://arxiv.org/html/2604.18897#S5)–[7](https://arxiv.org/html/2604.18897#S7)present results, analysis, and multi\-model generalization findings\. Section[8](https://arxiv.org/html/2604.18897#S8)characterizes the single\-prompt ceiling theoretically\. Section[9](https://arxiv.org/html/2604.18897#S9)presents post\-submission validation against the official benchmark and cross\-distribution analysis of the Contributor Network leaderboard\. Section[10](https://arxiv.org/html/2604.18897#S10)concludes\.
## 2Background
### 2\.1Magmas and Equational Laws
A*magma*is a setMMequipped with a single binary operation⋆:M×M→M\{\\star\}:M\\times M\\to M, closed under that operation and subject to no further axioms\. No associativity, commutativity, identity element, or invertibility is assumed\. An*equational law*over a magma is a universally quantified identity of the formt1\(x,y,…\)=t2\(x,y,…\)t\_\{1\}\(x,y,\\ldots\)=t\_\{2\}\(x,y,\\ldots\), wheret1t\_\{1\}andt2t\_\{2\}are terms built from variables and⋆\\star\. The law holds in a magma\(M,⋆\)\(M,\\star\)if the identity is satisfied for every assignment of variables to elements ofMM\.
### 2\.2Equational Implication
Given two equational lawsE1E\_\{1\}andE2E\_\{2\}, we sayE1E\_\{1\}*implies*E2E\_\{2\}\(writtenE1⇒E2E\_\{1\}\\Rightarrow E\_\{2\}\) if every magma that satisfiesE1E\_\{1\}also satisfiesE2E\_\{2\}\. Deciding this relation is computationally asymmetric\. AFalseinstance — whereE1⇏E2E\_\{1\}\\not\\Rightarrow E\_\{2\}— can be certified by exhibiting a single finite counterexample magma in whichE1E\_\{1\}holds butE2E\_\{2\}fails; such certificates exist for allFalsecases among small magmas, makingFalsedecidable via finite search\. ATrueinstance requires establishing that no counterexample exists among all magmas, including infinite ones, for which no general algorithm is known\. The implication problem over magmas is undecidable in general; finite model search is complete forFalsebut not forTrue\.
### 2\.3The Equational Theories Project
The Equational Theories Project\[[12](https://arxiv.org/html/2604.18897#bib.bib14)\]is a large\-scale collaborative effort to map the implication structure of equational laws over magmas, formalized in Lean 4\. The project has verified approximately 4,694 distinct equational laws and established the implication status of roughly 22 million equation pairs, producing the largest formally verified database of algebraic implications to date\. This dataset provides both the mathematical foundation and the training signal for the SAIR competition benchmark\.
### 2\.4Why the Task Is Hard for Language Models
The asymmetry betweenTrueandFalsecreates a fundamental challenge for any fixed reasoning strategy\. Producing a validFalseverdict requires constructing or recalling a specific finite structure — a task amenable to lookup but not to general inference\. Producing a validTrueverdict requires establishing universal quantification over an infinite class of structures — a task that exceeds any finite enumeration and for which no sound and complete proof procedure exists in the general case\. A language model operating without external symbolic tools must approximate both tasks within a single generation, using pattern recognition over its training distribution as a substitute for proof search\. This approximation is the object of study in the present work\.
## 3Related Work
#### LLM mathematical reasoning benchmarks\.
GSM8K\[[3](https://arxiv.org/html/2604.18897#bib.bib6)\]and MATH\[[4](https://arxiv.org/html/2604.18897#bib.bib7)\]established word\-problem and competition\-mathematics benchmarks that remain standard, but both assess reasoning over numeric domains with human\-interpretable intermediate steps\. Tasks requiring universal logical closure — deciding that a statement holds over*all*instances of a structure — are fundamentally different: a single counterexample refutes a universal claim, yet no finite check can confirm one\. The SAIR Equational Theories benchmark evaluates 25 models across 200 problems under varying conditions\. The official results show that even the strongest available model \(Gemini 1\.5 Pro\) achieves 90\.2% on hard problems without a cheatsheet, while weaker models cluster near chance, making the benchmark unusually diagnostic of ceiling effects\.
#### Prompt engineering for formal reasoning\.
Chain\-of\-thought prompting\[[14](https://arxiv.org/html/2604.18897#bib.bib9)\]demonstrated that instructing models to produce intermediate reasoning steps substantially improves performance on multi\-step problems\. Subsequent work established that few\-shot exemplars\[[1](https://arxiv.org/html/2604.18897#bib.bib8)\]and zero\-shot chain\-of\-thought instructions\[[5](https://arxiv.org/html/2604.18897#bib.bib10)\]are broadly effective\. However, these findings are primarily established on numeric and commonsense domains\. Less is known about how prompt complexity interacts with formal algebraic reasoning, where rule fidelity — not just step count — determines correctness\.
#### Sycophancy and cognitive load in LLMs\.
Sharmaet al\.\[[10](https://arxiv.org/html/2604.18897#bib.bib12)\]documented that RLHF\-trained models consistently exhibit sycophancy across varied free\-form generation tasks, adjusting their stated conclusions toward what they perceive the user wants even when incorrect\. A related phenomenon — which we term*cognitive load collapse*— occurs when a prompt’s rule system is too complex for a model to follow reliably, causing it to default to surface heuristics\.Shiet al\.\[[11](https://arxiv.org/html/2604.18897#bib.bib13)\]showed that irrelevant context in math problems dramatically degrades accuracy, providing evidence that additional text can actively harm rather than help reasoning\.
#### The Equational Theories Project\.
The mathematical foundations of this competition derive from ongoing work on automated proof search over equational theories\[[12](https://arxiv.org/html/2604.18897#bib.bib14)\]\. The project established that implications between equational laws over magmas exhibit complex dependency structures not tractable by brute\-force enumeration, motivating the use of language models as approximate reasoners over this space\.
#### In\-context learning: length versus performance\.
Liuet al\.\[[6](https://arxiv.org/html/2604.18897#bib.bib11)\]showed that model attention is non\-uniform over context: information at the beginning and end of a prompt is retrieved more reliably than information in the middle\. This has direct implications for structured cheatsheets\. Our finding that trivial\-magma\-first ordering \(AN45c\) outperforms counterexample\-table\-first ordering \(AN38\) by\+7\.5\+7\.5percentage points at full scale \(n=400n=400; Section[5](https://arxiv.org/html/2604.18897#S5)\)\. We hypothesize that placing the trivial\-magma check first primes the model’s attention towardTrueverdicts before engaging the CE search, with the first substantive rule receiving disproportionate weight during generation; this mechanism remains to be verified through attention analysis\.
## 4Methodology
### 4\.1Task and Benchmark
The SAIR Equational Theories Stage 1 competition requires deciding, for a given pair of equations\(E1,E2\)\(E\_\{1\},E\_\{2\}\)over magmas, whetherE1⇒E2E\_\{1\}\\Rightarrow E\_\{2\}holds universally \(label:True\) or whether a counterexample magma exists \(label:False\)\. The official evaluation judge is described inSAIR Foundation \[[9](https://arxiv.org/html/2604.18897#bib.bib5)\]\. All experiments use the publicly available dataset SAIRfoundation/equational\-theories\-selected\-problems\(HuggingFace\)\. We work with four labeled splits, summarized in Table[1](https://arxiv.org/html/2604.18897#S4.T1)\.
Table 1:Dataset splits used in this study\.Hard3 serves as the primary evaluation split: its near\-balanced distribution and size \(n=400n=400\) most closely approximate the competition’s private evaluation set\. Hard1 is used selectively to testFalse\-detection strategies; hard2 provides a secondary balanced check\. Normal problems confirm that interventions do not regress standard performance\.
### 4\.2Models and Inference Configuration
We evaluate across three models matching the competition’s official multi\-model setup:
gpt\-oss\-120b\(primary\): An open\-weight 117B\-parameter Mixture\-of\-Experts model released by OpenAI under Apache 2\.0 license\[[7](https://arxiv.org/html/2604.18897#bib.bib4)\], accessed via DeepInfra bf16 routing on OpenRouter \(openai/gpt\-oss\-120b\)\. This model uses an extended reasoning mode that produces chain\-of\-thought prior to its final verdict\. All competition\-credit evaluations use this model\.
Llama 3\.3 70B: Accessed via Together AI \(meta\-llama/Llama\-3\.3\-70B\-Instruct\-Turbo\)\. Standard instruction\-tuned mode; no extended reasoning\.
Gemma 4 31B: Accessed via Together AI \(google/gemma\-4\-31b\-it\) withmax\_tokens=8192to enable the model’s native reasoning trace\. OpenRouter routing for this model suppresses reasoning mode, causing it to default to a near\-constantTrueoutput \(≈\\approx53%\); all reported Gemma results use Together AI exclusively\.
All inference runs usetemperature=0andseed=42\.max\_tokensis set to 4,096 for gpt\-oss\-120b and 8,192 for Gemma 4 31B\. A preliminary experiment established that truncation at lower token budgets \(512, 1,024, 2,048\) produced 100% truncation\-caused errors; 4,096 was the minimum budget at which genuine reasoning errors first appeared\. Estimated cost per problem is $0\.005–$0\.01 depending on prompt length and model\.
### 4\.3Evaluation Metrics
We report three primary metrics:Accuracy\(fraction correct\),Truerecall\(fraction ofTrue\-labeled problems answeredTrue\), andFalserecall\(fraction ofFalse\-labeled problems answeredFalse\)\. For multi\-model comparisons we report the3\-model average: the unweighted mean of each model’s accuracy on the same split, aligned with the competition’s scoring rule\.
Non\-determinism is a practical concern attemperature=0: we observed up to±\\pm3 percentage points of variance across identical runs \(e\.g\., AN3d: 73\.9% vs\. 79\.7% on separate executions\)\. We report observed values without averaging across runs, and flag cases wheren≤20n\\leq 20makes estimates unreliable \(±\\pm10pp at the 95% level\)\.
### 4\.4Prompt Design Space
Each submission is a single UTF\-8 text file of at most 10KB containing two placeholders,\{\{ equation1 \}\}and\{\{ equation2 \}\}\. We designed and evaluated 45\+ prompt variants over five weeks \(AN\-series: AN1 through AN45d\)\.
Pipeline note\.AN45c uses a self\-contained template format with\{\{ equation1 \}\}and\{\{ equation2 \}\}placeholders inside the prompt body, requiring the\-\-raw\-promptflag for correct substitution\. An early run omitted this flag; the placeholders reached the model unsubstituted, producing an artifactual 56% result that was identified and discarded\. All other variants in Table[2](https://arxiv.org/html/2604.18897#S5.T2)usebuild\_prompt\(\), which embeds equations inline and is unaffected by this flag\. The corrected AN45c results \(April 14, 2026\) are the only ones reported here\.
Variants differ along five design dimensions:
1. 1\.Counterexample \(CE\) table content: which small magmas are provided, from 4 size\-2 structures \(early variants\) to 7 size\-2 and 5 size\-3 structures \(AN38/AN45c\)\.
2. 2\.Singleton\-forcing rules: heuristics for detecting equations that force all elements equal \(TRUE shortcuts\)\. S1 is sound; S2 is unsound for right\-zero magmas \(Section[6](https://arxiv.org/html/2604.18897#S6)\)\.
3. 3\.Block rules: conditional classifiers for structural patterns predictive ofFalse\. The most aggressive variant \(Block 1 in AN3c\) blanket\-classified self\-referential patterns asFalse, achieving high recall on hard1 but catastrophic precision loss on balanced splits\.
4. 4\.Instruction ordering: whether the trivial\-magma check or the CE table appears first\. AN45c places the trivial\-magma check first; AN38 places the CE table first\.
5. 5\.Prompt length: 0 bytes \(baseline\) to 4,878 bytes \(AN5, the longest variant tested\)\.
Variants were selected for full\-scale evaluation based on small\-sample performance \(n=20n=20–50\), with the most promising candidates validated atn=200n=200–400\. All prompt variants, evaluation scripts, and results are available at our companion repository\[[2](https://arxiv.org/html/2604.18897#bib.bib2)\]\.
## 5Results
We present results along three axes: single\-model performance across prompt variants \(§[5\.1](https://arxiv.org/html/2604.18897#S5.SS1)\), cross\-model generalization \(§[5\.2](https://arxiv.org/html/2604.18897#S5.SS2)\), and cross\-dataset stability \(§[5\.3](https://arxiv.org/html/2604.18897#S5.SS3)\)\.
#### Sample size note\.
Results for AN45c on gpt\-oss\-120b are based onn=400n=400hard3 problems \(full\-scale corrected pipeline\)\. Cross\-model runs \(Llama 3\.3 70B, Gemma 4 31B\) and the DeepSeek V3\.2 result usen=50n=50,n=20n=20, andn=10n=10respectively, as noted in Table[2](https://arxiv.org/html/2604.18897#S5.T2)\.
### 5\.1Single\-Model Results \(gpt\-oss\-120b, hard3\)
Table[2](https://arxiv.org/html/2604.18897#S5.T2)reports accuracy,Truerecall,Falserecall, and prompt size for all variants evaluated on gpt\-oss\-120b on the hard3 split \(n=50n=50unless noted\), ordered by descending accuracy\. The no\-cheatsheet baseline \(own run, April 14, 2026,n=400n=400\) achieves 59\.75% overall \(95% CI: \[54\.9%, 64\.4%\]\), with a severeTruebias: 82\.6%Truerecall but only 38\.0%Falserecall\. This profile is consistent with the structuralTruebias documented in the official benchmark \(89\.2%True, 26\.5%Falseon the hard split\)\.
Table 2:Single\-model results on hard3 \(gpt\-oss\-120b,n=50n=50unless noted\)\.VariantAccT%F%BytesStrategyAN45c \(n=400n=400\)†\\dagger79\.395\.963\.42,252Trivial magma first \+ CE tables A–NAN45c \(n=20n=20, OR\)§\\S90\.0–95\.0——2,252Cross\-provider validation \(OpenRouter/DeepInfra\)AN38 \(n=400n=400\)‡\\ddagger71\.878\.565\.41,776CE tables A–N \(3\-elem focus\)AN38 \(n=50n=50\)74\.070\.876\.91,776CE tables A–N \(3\-elem focus\)AN4372\.054\.288\.52,171Controller arch \(BLOCKs as router\)AN3572\.058\.384\.61,5453\-elem CE focusAN35b72\.079\.265\.41,802Cautious TRUE \+ 3\-elem CEAN45d70\.0100\.040\.02,538AN45c \+ corrected STEP 1 flagAN3670\.050\.088\.51,205Aggressive FALSE priorAN3970\.091\.750\.0385Power\-level priorBaseline \(n=400n=400\)59\.7582\.638\.00No cheatsheet \(own run, April 14, 2026\)AN3c \(n=50n=50\)64\.045\.880\.84,306BLOCK systemAN1064\.0——3,303Symbolic engineAN19c62\.091\.734\.6289Trivial magma hint onlyAN4262\.0———KB rewritingAN4060\.0———Semantic invariantsAN4158\.0———Tamari/tree\-structureAN554\.0——4,878Maximum\-length CE table†\\daggerCorrected pipeline \(\-\-raw\-prompt\);n=400n=400; primary result\.‡\\ddaggerFull\-scale run; most reliable pre\-fix estimate\.§\\SCross\-provider check only;n=20n=20,±\\pm10pp CI\.Wilson 95% confidence intervals:AN45cn=400n=400, 317/400=79\.3%:\[75\.0%, 82\.9%\]; AN38n=400n=400, 287/400=71\.8%: \[67\.1%, 75\.9%\]; AN3c hard1n=69n=69, 54/69=78\.3%: \[67\.2%, 86\.4%\]\.Performance is non\-monotonic in prompt length: the longest variant \(AN5, 4,878 bytes\) is the worst\-performing cheatsheet, while the shortest effective variant \(AN39, 385 bytes\) ties with mid\-length prompts at 70%\.TrueandFalserecall trade off sharply: no variant simultaneously achieves both above 80% on hard3\. AN19c maximizesTruerecall \(91\.7%\) at the cost of near\-completeFalserecall collapse \(34\.6%\), while AN36 inverts this profile \(50\.0%True, 88\.5%False\)\.
The AN45d result \(100%True, 40%False\) warrants attention: a minor modification to AN45c caused sixFalse→\\toTrueregressions, dropping overall accuracy from 90% to 70%\. The ordering effect in AN45c depends on the exact token sequence, not on the semantic content of the rules\.
#### Cheatsheet effect on model bias\.
The no\-cheatsheet baseline exhibits strong structuralTruebias: 82\.6%Truerecall versus only 38\.0%Falserecall \(n=400n=400, own run\)\. AN45c does not merely raise overall accuracy — it rebalances this bias\.Falserecall improves by\+25\.4\+25\.4percentage points \(38\.0%→\\to63\.4%\), whileTruerecall increases further to 95\.9%\. The cheatsheet therefore serves two distinct functions: it corrects the model’s structural tendency to classify hard problems asTrue, and it provides the finite counterexample evidence needed to produce confidentFalseverdicts\.
### 5\.2Cross\-Model Results
Table[3](https://arxiv.org/html/2604.18897#S5.T3)reports results for key variants across all three competition models\.
Table 3:Multi\-model results on hard3 \(n=20n=20for AN45c;n=50n=50otherwise\)\.VariantBytesgpt\-ossLlamaGemma3\-avgAN45c – Scenario A \(official\)†\\dagger2,25295%55%53% \(OR\)67\.7%AN45c – Scenario B2,25290%55%85% \(TAI\)76\.7%AN45c – Scenario C2,25295%55%85% \(TAI\)78\.3%AN381,77674%52%54%59\.3%AN19c28962%60%55%59\.0%Baseline059\.75%52%≈\\approx50%—AN3c4,30664%≈\\approx0%‡\\ddagger——†\\daggerScenario A uses OR/Novita bf16 Gemma \(reasoning suppressed, near\-constantTrueoutput\)\. Conservative official lower bound\. Scenarios B/C use Together AI \(8,192 tok, reasoning enabled\); differ only in GPT result \(90% vs\. 95%,n=20n=20\)\.‡\\ddagger0%Truerecall; overall accuracy reflects onlyFalsecorrect answers\.We report 67\.7% \(Scenario A\) as the conservative official 3\-model average\. Scenarios B and C \(76\.7–78\.3%\) represent Gemma performance when correctly configured; we report them to distinguish configuration artifact from model capability\.
The most striking cross\-model finding is the AN3c collapse on Llama: the 4,306\-byteBlocksystem causes Llama to outputFalsefor every problem, yielding 0%Truerecall\. AN19c \(289 bytes\) is the only variant that produces balanced recall on Llama \(37\.5%True, 80\.8%False\), and the only variant where Llama marginally outperforms gpt\-oss\-120b \(60% vs\. 62%, within noise\)\.
### 5\.3Cross\-Dataset Generalization
Table[4](https://arxiv.org/html/2604.18897#S5.T4)shows performance across splits for AN3c and AN38 — variants optimized on hard1 and hard3 respectively\.
Table 4:Cross\-dataset performance for selected variants \(gpt\-oss\-120b\)\.AN3c’s Block 1 rule was developed against hard1’sFalse\-heavy distribution \(35%True\)\. On hard1 it achieves 78\.3%, the highest single\-split result in the study\. On hard2 \(balanced, 50%True\), the same rule drops to 60\.0% — only 8 percentage points above chance — because Block 1 misclassifies 51% ofTrueproblems asFalse\. This 18\.3 percentage\-point degradation illustrates the distribution mismatch failure mode precisely\. AN38 exhibits the complementary pathology: well\-calibrated for hard3 \(71\.8% full\-scale\) but near chance on hard1 \(50\.7%\)\. Both variants achieve≈\\approx92% on normal problems, confirming that normal\-split results are uninformative for distinguishing hard\-problem strategies\.
## 6Analysis
### 6\.1Why AN45c Works: The Trivial Magma as an Exit Gate
AN45c’s primary mechanism is structural rather than informational\. STEP 1 instructs the model to check whetherE1E\_\{1\}contains a variable appearing exactly once on one side — a condition that forces all elements of any satisfying magma to be equal, collapsing it to the trivial one\-element structure and makingE2E\_\{2\}vacuously true\. When triggered, STEP 1 bypasses STEP 2 \(the counterexample search\) entirely and commits toTruebefore the CE tables introduceFalse\-directional pressure\.
On hard3, AN45c achieves 95\.9%Truerecall compared to the baseline’s 82\.6% — a gain of\+13\.3\+13\.3percentage points — while achieving 63\.4%Falserecall versus the baseline’s 38\.0%, a gain of\+25\.4\+25\.4percentage points\. The overall improvement is\+19\.5\+19\.5pp atn=400n=400\(corrected pipeline\)\. This gain appears to arise from reordering existing components rather than adding new content: AN38 contains the same STEP 1 logic but places it*after*the CE table, producing only 70\.8%Truerecall and 76\.9%Falserecall at full scale\. We hypothesize that placing the trivial\-magma check first primes the model’s attention towardTrueverdicts before the CE search introducesFalse\-directional pressure — though isolating this ordering effect from confounds requires controlled experiments beyond the scope of this study\.
### 6\.2Why the Merge Ceiling Holds
The AN38 variant was constructed by merging two complementary predecessors: AN35 \(True=58\.3%,False=84\.6%\) and AN35b \(True=79\.2%,False=65\.4%\)\. The result was AN38:True=70\.8%,False=76\.9% — a near\-arithmetic mean of the two parents — not a combination of their strengths\. Three subsequent merge attempts \(AN45d, AN45e, AN45f\) all produced accuracy at or below the mean\. A static instruction set cannot simultaneously prime two competing inference strategies: when the model encounters rules favoring both labels, it averages rather than selects\.
### 6\.3The Token Cap Finding
Gemma 4 31B exhibits a distinctive failure mode whenmax\_tokensfalls below the budget required to complete CE verification\. At 2,048 tokens, Gemma with AN45c produces 50% accuracy with 0%Falserecall — effectively outputtingTruefor every problem\. At 8,192 tokens, the same model and prompt produces 85% accuracy \(True=90%,False=80%\)\. The mechanism: Gemma exhausts its token budget during STEP 2, fails to produce aCounterexampleblock, and defaults to the STEP 1 exit gate verdict \(True\)\. This is not a model capability failure; it is a configuration artifact\. Reported Gemma results are only valid when the token budget is sufficient to complete the full reasoning trace\.
### 6\.4Why Theoretical Approaches Fail
Four variants introducing explicit mathematical reasoning frameworks all degraded performance relative to the 68% baseline \(n=50n=50subset, consistent with Table[5](https://arxiv.org/html/2604.18897#S6.T5)deltas\):
Table 5:Theory\-based variants on hard3 \(gpt\-oss\-120b,n=50n=50\)\.The consistent direction of failure isFalse→\\toTrueerror increase\. When given a structured reasoning framework, the model generates plausible arguments satisfying the framework’s criteria and commits toTrue, even for problems where a small counterexample exists\. False precision in the instructed framework is worse than no framework at all\.
### 6\.5A Note on Numerical Coincidence
AN45c’s conservative 3\-model average \(Scenario A: 67\.7%\) and AN3c’s hard1 single\-model accuracy \(78\.3%\) are occasionally cited together in competition discussions\. They are not comparable: the former is a balanced hard3 average across three models under a specific provider configuration; the latter is a single\-model result on aFalse\-heavy split where a blanket\-Falsestrategy would already score 65%\. AN3c on a balanced split achieves 60–64%\.
#### Structural classification hypothesis\.
Our results suggest that large language models in this task behave not as symbolic theorem provers but as heuristic classifiers over algebraic structure patterns\. The prompt encodes a decision boundary over structural features — variable repetition, nesting depth, singleton forcing — rather than a proof system\. This framing explains both the effectiveness of counterexample tables \(which encode explicit structural decision rules\) and the failure of procedural approaches \(which require genuine symbolic execution\)\. We term this the*router hypothesis*: the model routes problems to cached structural patterns rather than deriving answers compositionally\. This has a direct practical implication: prompt improvements that add new structural rules can help, but only up to the capacity of the model to maintain and apply those rules consistently — which is precisely the saturation boundary we observe\.
## 7Multi\-Model Generalization
### 7\.1AN19c as the Model\-Agnostic Minimum
AN19c \(289 bytes\) consists of three natural\-language hints with no counterexample tables: a reminder that one\-element magmas trivially satisfy all equations, a note that equations with a free variable isolated on one side are likelyTrue, and an instruction to output a verdict with brief reasoning\. On gpt\-oss\-120b it achieves 62% on hard3 — below most CE\-table variants — but it is the only prompt that produces meaningfulTruerecall on Llama 3\.3 70B \(37\.5%True, 80\.8%False, 60% overall\) and competitive performance on Gemma 4 31B \(55%, Together AI\)\.
Table[6](https://arxiv.org/html/2604.18897#S7.T6)summarizes multi\-model performance for variants with coverage across three or more models\.
Table 6:Multi\-model generalization on hard3 \(n=10n=10–50; see notes\)\.
### 7\.2Model\-Specific Bias Profiles
gpt\-oss\-120bisTrue\-biased at baseline: 82\.6%Truerecall versus 38\.0%Falserecall on hard3\. CE table prompts partially correct this but never eliminate the bias — even AN38’s best full\-scale result \(71\.8%\) reduces the baseline’s 44\.6\-point recall gap \(82\.6%Truevs\. 38\.0%False\) to just 13\.1 points \(78\.5%Truevs\. 65\.4%False\) — a substantial improvement, but the imbalance persists\.
Llama 3\.3 70Bexhibits the opposite profile: near\-baselineFalserecall \(80\.8% with AN19c\) but severely depressedTruerecall \(37\.5% with the best prompt; effectively 0% with any prompt exceeding≈\\approx2KB\)\. Longer prompts collapseTruerecall further, suggesting a capacity ceiling on instruction\-following rather than a knowledge deficit\.
Gemma 4 31Bis token\-gated rather than directionally biased: its failure mode is token budget exhaustion, which triggers the STEP 1 exit gate and produces near\-constantTrueoutput\. With sufficient budget \(8,192 tokens\), Gemma’s recall profile is balanced and strong \(85% with AN45c onn=20n=20\)\.
### 7\.3DeepSeek V3\.2: Cost Efficiency Signal
DeepSeek V3\.2 with AN19c achieves 80% on hard3 \(n=10n=10\) at≈\\approx$0\.0008 per problem — roughly one order of magnitude cheaper than gpt\-oss\-120b at comparable accuracy\. With AN38 \(1,776 bytes\), DeepSeek drops to 60%, confirming the pattern observed in Llama: CE table prompts hurt capable reasoners by introducing distracting structure that interferes with internal algebraic reasoning\. The AN19c result on DeepSeek is the most cost\-efficient signal in the study \(n=10n=10; highest\-priority target for scale validation\)\.
## 8The Single\-Prompt Ceiling
### 8\.1Formal Characterization
We define the*empirical saturation region*of single\-prompt engineering as the accuracy range where further prompt iterations produce unstable, non\-generalizable improvements on a given model, holding inference parameters fixed\. Empirically, the saturation region lies at approximately 71–79% for gpt\-oss\-120b on hard3 at scale \(AN38: 71\.8% atn=400n=400; AN45c: 79\.3% \(95% CI: \[75\.0%, 82\.9%\]\) atn=400n=400with corrected pipeline\)\. Despite more than 45 variants tested over five weeks, no full\-scale evaluation exceeded this range on a balanced split\.
This saturation is not merely a measurement artifact\. It manifests as a structural pattern: prompt elements that increaseTruerecall tend to decreaseFalserecall by approximately the same margin, and vice versa\.
The best\-performing prompts do not form a dispersed cloud in \(Truerecall,Falserecall\) space — they approximate a Pareto front\. Three variants define its boundary: AN19c \(True=91\.7%,False=34\.6%\), AN45c \(True=90\.0%,False=80\.0%\), and AN36 \(True=50\.0%,False=88\.5%\)\. AN45c dominates most other variants on both dimensions simultaneously — including AN38, AN35, AN35b, and AN43 — but does not dominate AN36 onFalserecall \(80\.0% vs\. 88\.5%\) or AN19c onTruerecall \(90\.0% vs\. 91\.7%\)\. The saturation region therefore describes an empirical Pareto\-optimal boundary within the single\-prompt paradigm — not necessarily an absolute theoretical limit, but a practical constraint that our 45\+ variant search did not overcome\.
### 8\.2The Merge Pattern:avg\(A,B\)\\mathrm\{avg\}\(A,B\), Notmax\(A,B\)\\max\(A,B\)
AN35 and AN35b are complementary specialists: AN35 achievesTrue=58\.3%,False=84\.6%; AN35b achievesTrue=79\.2%,False=65\.4%\. Their arithmetic means areTrue=68\.8%,False=75\.0%\. The merge result AN38 producesTrue=70\.8%,False=76\.9% — within 2pp of the arithmetic mean on both dimensions, not near the respective maxima\. Three subsequent attempts \(AN45d, AN45e, AN45f\) replicated the averaging result\. Combining complementary prompts in a static text file yields average performance, not maximum performance\.
### 8\.3Why Routing Cannot Be Encoded in a Static Prompt
TrueandFalsehard problems require qualitatively different inference strategies\.Trueproblems are best handled by identifying structural properties forcing singleton magmas or by exhausting small counterexample candidates\.Falseproblems are best handled by finding a single finite counterexample quickly\. A prompt that heavily weights the CE search primesFalsedetection and suppressesTrue; a prompt that heavily weights the trivial\-magma check does the opposite\.
Solving this tension requires conditional routing: apply theTruestrategy when the equation has a specific syntactic form, apply theFalsestrategy otherwise\. A static prompt cannot implement this routing reliably because LLMs execute instructions probabilistically rather than conditionally — they interpolate between strategies in proportion to their textual weight rather than selecting the appropriate one per instance\.
### 8\.4Theoretical Upper Bound and the Path Beyond
gpt\-oss\-120b with no cheatsheet achieves 91\.7%Truerecall and 92\.3%Falserecall on*normal*problems\. If a perfect external router could direct each hard problem to the appropriate strategy, the theoretical ceiling would be approximately:
0\.5×91\.7%\+0\.5×92\.3%=92\.0%0\.5\\times 91\.7\\%\+0\.5\\times 92\.3\\%=92\.0\\%— well above the observed 71–79% saturation region\. Gemini 1\.5 Pro achieves 90\.2% on hard problems with no cheatsheet, suggesting that frontier models have internalized routing\-equivalent algebraic reasoning that cannot be injected through prompting into weaker models\.
Concretely, escaping the empirical saturation region for gpt\-oss\-120b on this task likely requires one of: \(a\) an ensemble of specialized prompts with external routing by problem type; \(b\) fine\-tuning on the Equational Theories Project graph \(≈\\approx22 million labeled equation pairs\); or \(c\) a hybrid LLM \+ symbolic verifier architecture where the LLM proposes candidate counterexamples and a Mace4\-equivalent tool verifies them\.
## 9Official Benchmark Validation and Cross\-Distribution Analysis
### 9\.1Data and Scope
All results in this section are drawn from the SAIR Contributor Network public leaderboard as of April 20, 2026 \(n=52n=52voluntary public submissions of 1,007 total registered participants, cited under SAIR’s open science framework\[[8](https://arxiv.org/html/2604.18897#bib.bib1)\]\)\. The full competition leaderboard is scheduled for release on or before April 30, 2026\. Official benchmark scores use the SAIR evaluation pipeline: OpenRouter/DeepInfra bf16, temperature 0\.0, seed 0, max tokens 8,192\[[9](https://arxiv.org/html/2604.18897#bib.bib5)\]\. Local scores use Together AI bf16, temperature 0\.0, seed 42\.
### 9\.2Official Benchmark Results
Table[7](https://arxiv.org/html/2604.18897#S9.T7)compares local evaluation results against the SAIR official benchmark on hard3 \(the competition’s reference format, as confirmed by the official smoke\-test fileproblems\_hard3\_20\.jsonl\[[9](https://arxiv.org/html/2604.18897#bib.bib5)\]\)\.
Table 7:Local vs\. official benchmark results on hard3 \(GPT\-OSS 120B\)\. Official benchmark:n=20n=20, DeepInfra bf16\. Local:n=400n=400, Together AI bf16\. Baseline: no cheatsheet\.AN38 produces a robust\+\+5\.6pp improvement over the official baseline on hard3, consistent with its local result \(71\.8%\)\. AN45c, despite achieving 79\.25% locally \(n=400n=400\), scores 55\.5% on the official benchmark — 4\.3pp*below*the no\-cheatsheet baseline\. The trivial\-magma exit gate \(STEP 1\), which was AN45c’s primary structural innovation over AN38, appears to generate forced errors on the official problem sample: problems that satisfy the singleton test are committed toTruebefore counterexample evidence is considered\. This is the same distribution\-mismatch failure mode documented in Section[5](https://arxiv.org/html/2604.18897#S5)for AN3c on hard1/hard2\.
Table 8:Official benchmark results on hard2 \(GPT\-OSS 120B,n=20n=20\)\. Hard2 baseline: 56\.3% accuracy, F1≈\\approx64\.9%\.On hard2, both variants fall below the no\-cheatsheet baseline\. AN38 incurs a larger accuracy penalty \(−\-15\.3pp\) and F1 degradation \(−\-17\.1pp\), while AN45c is less damaging \(−\-8\.3pp accuracy,−\-3\.7pp F1\)\. Neither cheatsheet improves over baseline on hard2, confirming that both are calibrated toward the hard3 distribution\.
### 9\.3Cross\-Distribution Trade\-Off Surface
Of the 52 Contributor Network submissions, 13 had benchmark results available on both hard2 and hard3 as of competition close \(April 20, 2026\); Table[9](https://arxiv.org/html/2604.18897#S9.T9)presents these 13 submissions\.
Table 9:Cross\-distribution performance: hard2 vs\. hard3 \(GPT\-OSS 120B, SAIR official benchmark\)\. Contributor Network data, April 20, 2026\[[8](https://arxiv.org/html/2604.18897#bib.bib1)\]\. Participants cited by name under SAIR’s open science framework\.The table reveals a consistent pattern: among the 52 Contributor Network submissions with benchmark results, only one — Arjun Garg’sbank\_lookup\_v8\(83\.0% hard2, 70\.8% hard3\) — achieves above 65% accuracy on both distributions simultaneously\. All other submissions optimizing for one distribution collapse on the other, with accuracy gaps ranging from 11 to 62\.7 percentage points \(Table[9](https://arxiv.org/html/2604.18897#S9.T9)\)\. The highest\-accuracy submissions on each split are explicitly or implicitly distribution\-specific: Betka’s 99\.0% on hard2 collapses to 56\.3% on hard3 \(−\-42\.7pp\); Woon Siang Yi’s 81\.3% on hard3 collapses to 51\.0% on hard2 \(−\-30\.3pp\), with the participant themselves labeling the submissionhard3\_overfitted\. We term this thecross\-distribution trade\-off surface: within the empirical saturation region, optimizing for one distribution trades against performance on complementary distributions\. AN45c exhibits one of the smallest cross\-split accuracy gaps \(7\.5pp\) among submissions scoring above the no\-cheatsheet baseline on at least one split, suggesting that distribution robustness and peak accuracy are competing objectives within the saturation zone\.
### 9\.4Implications for the Saturation Region
These results reframe the saturation region finding\. The ceiling is not merely an upper bound on accuracy — it is a*fragility zone*where gains are distribution\-dependent and structural innovations introduce failure modes on unseen distributions\. AN45c’s local result \(79\.25%,n=400n=400\) is valid under its measurement conditions; the 23\.75pp gap to the official benchmark is not measurement error but distribution mismatch, the same phenomenon documented internally in Section[5](https://arxiv.org/html/2604.18897#S5)\.
For practitioners, this implies: a cheatsheet that substantially outperforms baseline on a held\-out set does not guarantee generalization\. Cross\-distribution validation — running the same variant on two structurally different problem samples — is necessary to distinguish genuine ceiling improvements from distribution\-specific optimization\.
A structurally distinct approach observed in the Contributor Network further supports the router hypothesis: one participant \(Heath,distilled\-rules\-12\) submitted a pure structural classifier computing five syntactic features ofE1E\_\{1\}\(variable counts, node depths, LHS structure\) and applying a hand\-coded decision tree with no mathematical content\. This approach achieves 80\.0% on hard2 and 59\.0% on hard3 — competitive with CE\-table approaches — suggesting that structural pattern classification alone, without any algebraic reasoning, captures a substantial fraction of the signal available to prompt\-based methods\. This constitutes independent empirical evidence for the router hypothesis: the task rewards structural pattern matching over symbolic reasoning\.
Notably, one participant \(McKenna, unpublished\) reported achieving approximately 70% accuracy across all three evaluation models simultaneously using a single theory\-grounded cheatsheet — the only submission observed to maintain consistent performance across gpt\-oss\-120b, Llama 3\.3 70B, and Gemma 4 31B\. This convergence across models with substantially different capacity profiles suggests that mathematically grounded approaches may generalize more robustly than empirically iterated CE\-table approaches, a hypothesis warranting systematic investigation in future work\.
## 10Conclusion
#### Summary of contributions\.
We present the first systematic empirical characterization of the single\-prompt ceiling in formal reasoning tasks\. Across 45\+ prompt variants, balanced hard accuracy for gpt\-oss\-120b on hard3 plateaus at≈\\approx71–79% at scale\. Merge experiments confirm that combining complementary prompts produces accuracy near the arithmetic mean of the parents, not near their maxima — a structural constraint, not a local optimum\. Our highest\-performing local submission \(AN45c, 2,252 bytes\) achieves 79\.25% on gpt\-oss\-120b \(n=400n=400; 95% CI: \[75\.0%, 82\.9%\]\), withTruerecall of 95\.9% andFalserecall of 63\.4%\. AN38 is our most robust submission under distribution shift, producing\+\+5\.6pp over the official baseline \(Section[9](https://arxiv.org/html/2604.18897#S9)\)\. A cross\-provider validation run on OpenRouter/DeepInfra bf16 \(n=20n=20\) yielded 90–95%, consistent with the full\-scale local result\. The token cap finding demonstrates that Gemma’s performance \(50% vs\. 85%\) is entirely determined by provider configuration, not model capability\. Model\-specific bias profiles —Truebias in gpt\-oss\-120b, instruction\-capacity collapse in Llama, token\-gated reasoning in Gemma — are stable across variants and not correctable by prompt engineering alone\.
Post\-submission analysis against the SAIR official benchmark \(Section[9](https://arxiv.org/html/2604.18897#S9)\) reveals that AN45c \(79\.25% local,n=400n=400\) scores 55\.5% on the official hard3 benchmark — 4\.3pp below the no\-cheatsheet baseline — while AN38 produces a robust\+\+5\.6pp improvement \(65\.3%\) with the highest balanced F1 score \(67\.6%\) among non\-overfit submissions in the Contributor Network \(n=52n=52visible submissions, April 20, 2026\)\. The cross\-distribution trade\-off surface \(Table[9](https://arxiv.org/html/2604.18897#S9.T9)\) shows that among the 52 Contributor Network submissions, only Arjun Garg’sbank\_lookup\_v8achieves above 65% on both hard2 and hard3 simultaneously \(83\.0% and 70\.8%\), confirming that the saturation region is a fragility zone where distribution\-specific gains trade against generalization\.
#### Limitations\.
The primary AN45c result \(79\.25%,n=400n=400\) is statistically reliable \(95% CI: \[75\.0%, 82\.9%\]\); however, itsTruerecall \(95\.9%\) andFalserecall \(63\.4%\) reflect an imbalanced profile that may not generalize across providers or problem distributions\. Small\-sample limitations remain for Gemma \(85%,n=20n=20\) and DeepSeek \(80%,n=10n=10\), both carrying±\\pm10–15pp confidence intervals\. The cross\-provider OpenRouter/DeepInfra run \(n=20n=20, 90–95%\) should be read as a consistency check, not an independent measurement\. The Gemma official provider \(OpenRouter/Novita bf16\) suppresses reasoning mode; we were unable to test this configuration in a controlled way\. All results are specific to equational implication over magmas; generalization to other formal reasoning domains is an open question\.
The primary AN45c result \(79\.25%,n=400n=400, Together AI bf16\) does not generalize to the SAIR official benchmark \(55\.5%,n=20n=20, DeepInfra bf16\), a gap of−\-23\.75pp\. AN38’s smaller gap \(−\-6\.5pp local to official\) suggests that simpler prompts generalize more robustly within the saturation region\. The Contributor Network data \(n=52n=52of 1,007 participants\) is a voluntary public sample; the full competition leaderboard \(scheduled April 30, 2026\) may reveal additional patterns not visible in the current data\.
#### Future work\.
Three directions follow from the ceiling characterization: \(1\) an ensemble of two specialized prompts with external routing by syntactic problem features; \(2\) fine\-tuning on the Equational Theories Project implication graph \(≈\\approx22M labeled pairs\); and \(3\) replication on other computationally asymmetric formal reasoning tasks \(satisfiability checking, reachability in formal systems\)\.
#### Closing\.
The central lesson is structural: what improved performance was not teaching the model new mathematics but controlling the order in which the model applies the mathematics it already knows\. Expanding the model’s reasoning repertoire through longer, more elaborate prompts consistently underperformed constraining its reasoning flow through minimal, well\-ordered instructions\. In formal reasoning as in engineering: less, structured deliberately, beats more\.
## References
- \[1\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.Cited by:[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px2.p1.1)\.
- \[2\]\(2026\)SAIR prompt engineering — equational theories stage 1\.Note:[https://github\.com/israelcazares/sair\-prompt\-engineering](https://github.com/israelcazares/sair-prompt-engineering)Accessed April 2026Cited by:[§4\.4](https://arxiv.org/html/2604.18897#S4.SS4.p5.2)\.
- \[3\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168Cited by:[§1](https://arxiv.org/html/2604.18897#S1.p2.1),[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px1.p1.1)\.
- \[4\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 4130–4143\.Cited by:[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px1.p1.1)\.
- \[5\]T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa\(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 22199–22213\.Cited by:[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px2.p1.1)\.
- \[6\]N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang\(2023\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px5.p1.2)\.
- \[7\]OpenAI\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925Cited by:[§4\.2](https://arxiv.org/html/2604.18897#S4.SS2.p2.1)\.
- \[8\]SAIR Foundation\(2026\)SAIR mathematics distillation challenge — equational theories: contributor network leaderboard\.Note:Data as of April 20, 2026 \(competition close\)\.n=52n=52voluntary public submissions of 1,007 total registered participants\. Full competition leaderboard scheduled for release on or before April 30, 2026\. Cited under SAIR open science framework \(Official Rules, Section 5\)External Links:[Link](https://competition.sair.foundation/competitions/mathematics-distillation-challenge-equational-theories-stage1/leaderboard)Cited by:[§9\.1](https://arxiv.org/html/2604.18897#S9.SS1.p1.1),[Table 9](https://arxiv.org/html/2604.18897#S9.T9)\.
- \[9\]SAIR Foundation\(2026\)Stage 1 judge for the mathematics distillation challenge: equational theories\.Note:[https://github\.com/SAIRcompetition/equational\-theories\-stage1\-judge](https://github.com/SAIRcompetition/equational-theories-stage1-judge)Official evaluation setup: OpenRouter/DeepInfra bf16, temperature 0\.0, seed 0, max tokens 8,192\. Canonical smoke test:problems\_hard3\_20\.jsonl\. Accessed April 2026Cited by:[§4\.1](https://arxiv.org/html/2604.18897#S4.SS1.p1.2),[§9\.1](https://arxiv.org/html/2604.18897#S9.SS1.p1.1),[§9\.2](https://arxiv.org/html/2604.18897#S9.SS2.p1.1)\.
- \[10\]M\. Sharma, M\. Tong, T\. Korbak, D\. Duvenaud, A\. Askell, S\. R\. Bowman, N\. Cheng, E\. Durmus, Z\. Hatfield\-Dodds, S\. R\. Johnston, S\. Kravec, T\. Maxwell, S\. McCandlish, K\. Ndousse, O\. Rausch, N\. Schiefer, D\. Yan, M\. Zhang, and E\. Perez\(2024\)Towards understanding sycophancy in language models\.InInternational Conference on Learning Representations,External Links:2310\.13548Cited by:[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px3.p1.1)\.
- \[11\]F\. Shi, X\. Chen, K\. Misra, N\. Scales, D\. Dohan, E\. H\. Chi, N\. Schärli, and D\. Zhou\(2023\)Large language models can be easily distracted by irrelevant context\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 31210–31227\.Cited by:[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px3.p1.1)\.
- \[12\]T\. Taoet al\.\(2024\)Equational theories project\.Note:[https://github\.com/teorth/equational\_theories](https://github.com/teorth/equational_theories)Accessed April 2026Cited by:[§2\.3](https://arxiv.org/html/2604.18897#S2.SS3.p1.1),[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px4.p1.1)\.
- \[13\]T\. Tao\(2026\)Mathematics distillation challenge – equational theories\.Note:[https://terrytao\.wordpress\.com/2026/03/13/mathematics\-distillation\-challenge\-equational\-theories/](https://terrytao.wordpress.com/2026/03/13/mathematics-distillation-challenge-equational-theories/)Accessed April 2026Cited by:[§1](https://arxiv.org/html/2604.18897#S1.p1.1)\.
- \[14\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.Cited by:[§3](https://arxiv.org/html/2604.18897#S3.SS0.SSS0.Px2.p1.1)\.Similar Articles
@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…
Benchmarked 7 LLMs on 5 math problems; Qwen3.5 27B and 35B A3B generated the longest reasoning chains, exceeding 10k tokens per question.
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.
The Problem with “Mathematically Proven” Claims About LLMs (15 minute read)
This article critiques the sensationalized media coverage of mathematical proofs regarding LLM limitations, specifically highlighting how conditional results about self-improvement are often misrepresented as universal impossibilities.
When Can LLMs Learn to Reason with Weak Supervision?
This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.