KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

arXiv cs.AI Papers

Summary

KACE introduces a knowledge-adaptive context engineering method that separates storage from usage via an epistemic tree and tiered self-consistency, achieving 62.2% on AIME 2025—a 10.4-point gain over fixed self-consistency.

arXiv:2606.00532v1 Announce Type: new Abstract: Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of learned guidance that can be used. Existing methods often conflate storage, what is learned across runs, with usage, what is included for a particular problem, and therefore inherit this prompt-size ceiling. We introduce Knowledge-Adaptive Context Engineering (KACE), which separates storage from usage through difficulty- and domain-based organization. Offline, a self-reflective learning loop distills training traces into an epistemic tree: a knowledge base of typed cards stratified by problem difficulty and epistemic domain. Each card is assigned to the difficulty-domain node corresponding to the failure from which it originated. At evaluation time, tiered self-consistency with per-tier agreement gates dynamically classifies each problem as easy, medium, or hard. Easy problems exit without retrieved cards, while harder problems retrieve only the matching branch of the tree. This tiered scheme matches or exceeds Best-of-N while using comparable compute, and it classifies problem difficulty with 78 percent pairwise concordance. The main empirical contribution is the construction and use of a difficulty- and domain-stratified knowledge base enabled by tiered self-consistency. On AIME 2025, KACE achieves 62.2 percent accuracy, a 10.4-point absolute gain over fixed Best-of-5 self-consistency at a comparable solver-call budget and a 5.6-point gain over the strongest learned-context baseline, Tiered + GEPA. We also observe consistent gains on MATH-HARD and the verifiable subset of OlymMATH.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:47 PM

# KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning
Source: [https://arxiv.org/html/2606.00532](https://arxiv.org/html/2606.00532)
Jayant Parashar School of Computing University of Georgia Jayant\.Parashar@uga\.edu &Suchendra M\. Bhandarkar School of Computing University of Georgia suchi@uga\.edu

###### Abstract

Context engineering has become an effective way to improve large language models on a range of tasks without updating their weights\. However, for mathematical reasoning, accumulating feedback monolithically in a single growing prompt leads to 1\) context bloat: irrelevant guidance begins to crowd out useful information, and 2\) an effective knowledge\-store ceiling imposed by the active prompt size\. Existing context\-engineering methods often conflate storage—what is learned across runs—with usage—what is included in the prompt for a particular problem—and therefore inherit this bottleneck\. We introduce Knowledge\-Adaptive Context Engineering \(KACE\), which separates storage from usage through an epistemic and difficulty\-based demarcation\. Offline, a self\-reflective learning loop distills training traces into an epistemic tree: a knowledge base of typed cards stratified by problem difficulty and epistemic domain\. Each card is assigned to the difficulty\-domain node corresponding to the failure from which it originated\. At evaluation time, a tiered self\-consistency procedure with per\-tier agreement gates dynamically classifies each problem as easy, medium, or hard: easy problems exit card\-free, while escalated problems retrieve only the matching branch of the tree\. By itself, this tiered scheme matches or exceeds Best\-of\-NNwhile using comparable compute, and it classifies problem difficulty with 78% pairwise concordance\. The central empirical contribution of this paper is the construction and use of a difficulty\- and domain\-stratified knowledge base, enabled by tiered self\-consistency\. On AIME 2025,KACEachieves 62\.2% accuracy: a10\.410\.4\-point absolute gain over fixed Best\-of\-55self\-consistency at a comparable solver\-call budget, and a5\.65\.6\-point gain over the strongest learned\-context baseline, Tiered\+\+GEPA\. We further observe consistent gains on MATH\-HARD and the verifiable subset of OlymMATH\.

## 1Introduction

Context engineering improves frozen language models by learning the information placed around a problem rather than updating model weights\. For mathematical reasoning, however, learned context creates a practical bottleneck: if every reflected lesson is accumulated into one prompt, useful guidance eventually competes with irrelevant guidance, and the reusable knowledge store remains bounded by the size of the active context\. Existing methods such as ACE\-style evolving playbooks\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.00532#bib.bib1)\)and GEPA\-style reflective prompt artifacts\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib15)\)demonstrate the value of learning from feedback, but they often keep storage and usage tightly coupled: what is learned is also what the solver sees\.

A separate line of work shows that reasoning compute should be allocated adaptively rather than uniformly\. Self\-consistency improves mathematical reasoning by sampling multiple solution paths and selecting by agreement\(Wanget al\.,[2022](https://arxiv.org/html/2606.00532#bib.bib6)\), while adaptive and cascade\-style methods vary inference effort according to problem difficulty\(Aggarwalet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib50); Snellet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib39); Kimet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib46); Changet al\.,[2026](https://arxiv.org/html/2606.00532#bib.bib47)\)\. We build on this idea with a locked tiered self\-consistency procedure\. Before adding any learned knowledge, this tiered procedure already matches or exceeds fixed Best\-of\-55self\-consistency while using comparable solver\-call budgets: on AIME 2025, MATH\-HARD, and OlymMATH\-EN it obtains52\.2%52\.2\\%,73\.3%73\.3\\%, and44\.4%44\.4\\%accuracy, compared with51\.8%51\.8\\%,72\.7%72\.7\\%, and40\.7%40\.7\\%for Best\-of\-55\(Table[3](https://arxiv.org/html/2606.00532#S3.T3)\)\. The same exit path also acts as an empirical difficulty signal: its tier ordering preserves the model\-derived difficulty ordering of78%78\\%of AIME 2025 problem pairs and80%80\\%of MATH\-HARD pairs, well above the50%50\\%random baseline \(Appendix[G](https://arxiv.org/html/2606.00532#A7)\)\.

We introduceKnowledge\-Adaptive Context Engineering\(KACE\), a framework that uses this tier signal to separate learned\-context storage from learned\-context usage\. Offline,KACEdistills training traces into an*epistemic tree*: a knowledge base of typed cards stratified by problem difficulty and epistemic domain\. Each card records a reusable reasoning aid, such as a lemma, invariant, decomposition pattern, or verification rule, and is placed at the difficulty\-domain node corresponding to the failure from which it was learned\. At evaluation time, tiered self\-consistency classifies each problem as easy, medium, or hard through per\-tier agreement gates: easy problems exit without cards, and escalated problems retrieve only the matching branch of the tree\. Thus the durable knowledge base can grow without forcing every problem to read the whole store\.

This design is motivated by a failure mode we observe in hard mathematical reasoning\. Monolithic context accumulation can degrade as irrelevant lessons crowd out useful ones; domain\-only partitioning mitigates this temporarily but does not solve the difficulty mismatch; compact prompt optimization avoids some bloat but limits the amount of reusable knowledge that can be stored\.KACEaddresses the shared assumption behind these failures: that one active context should carry the full burden of learned knowledge\. Instead, it treats learned context as a structured resource that is conditionally read according to the problem’s empirical difficulty and domain\.

#### Contributions\.

Our contributions are threefold\.\(1\) A difficulty\- and domain\-stratified epistemic tree\.KACEstores reusable mathematical reasoning knowledge outside the active prompt as typed cards placed under difficulty\-domain nodes\.\(2\) Locked tiered self\-consistency as a difficulty classifier\.We introduce tiered self\-consistency as both a compute\-equivalent alternative to fixed self\-consistency \(matching or exceeding Best\-of\-55on three benchmarks\) and a calibrated empirical\-difficulty signal: its exit ordering preserves the model’s own solve\-rate ordering on78%78\\%of AIME 2025 and80%80\\%of MATH\-HARD problem pairs, supplying the difficulty coordinate read byθR\\theta\_\{R\}without training a separate classifier\.\(3\) Empirical evidence on hard mathematical reasoning\.On AIME 2025,KACEachieves62\.2%62\.2\\%accuracy, a10\.410\.4\-point absolute gain over fixed Best\-of\-55self\-consistency at a comparable solver\-call budget and a5\.65\.6\-point gain over the strongest learned\-context baseline, Tiered\+\+GEPA\. We also observe consistent gains on MATH\-HARD and the verifiable subset of OlymMATH, with ablations isolating the value of the difficulty and domain axes\.

## 2Methodology

### 2\.1Problem Setup: Epistemic\-Difficulty Context Function

KACEwraps a frozen solver with a structured external context\. The learned object is an*epistemic tree*𝒦\\mathcal\{K\}: a persistent store of typed cards indexed by problem difficulty and epistemic domain\. Let𝒯=\{ES,MS,HS\}\\mathcal\{T\}=\\\{\\mathrm\{ES\},\\mathrm\{MS\},\\mathrm\{HS\}\\\}denote the three difficulty tiers and let𝒟\\mathcal\{D\}denote the set of domains\. The tree is a block\-structured object

𝒦=\{𝒦b\}b∈ℬ,ℬ=𝒯×𝒟,\\mathcal\{K\}=\\\{\\mathcal\{K\}\_\{b\}\\\}\_\{b\\in\\mathcal\{B\}\},\\qquad\\mathcal\{B\}=\\mathcal\{T\}\\times\\mathcal\{D\},\(1\)where each blockb=\(t,d\)b=\(t,d\)contains the cards visible at tierttand domaindd; universal cards are stored in𝒦univ\\mathcal\{K\}\_\{\\mathrm\{univ\}\}and are visible at the learned\-card tiers\.

For a problemxx, the active context is

C​\(x;θR,𝒦\)=Concat​\(x,ρ​\(x;θR,𝒦\)\),C\(x;\\theta\_\{R\},\\mathcal\{K\}\)=\\mathrm\{Concat\}\\big\(x,\\rho\(x;\\theta\_\{R\},\\mathcal\{K\}\)\\big\),\(2\)where

ρ​\(x;θR,𝒦\)=\{∅,θR​\(x\)∈\{ES\}×𝒟,𝒦θR​\(x\)∪𝒦univ,θR​\(x\)∈\{MS,HS\}×𝒟\.\\rho\(x;\\theta\_\{R\},\\mathcal\{K\}\)=\\begin\{cases\}\\emptyset,&\\theta\_\{R\}\(x\)\\in\\\{\\mathrm\{ES\}\\\}\\times\\mathcal\{D\},\\\\ \\mathcal\{K\}\_\{\\theta\_\{R\}\(x\)\}\\cup\\mathcal\{K\}\_\{\\mathrm\{univ\}\},&\\theta\_\{R\}\(x\)\\in\\\{\\mathrm\{MS\},\\mathrm\{HS\}\\\}\\times\\mathcal\{D\}\.\\end\{cases\}\(3\)The projectionθR\\theta\_\{R\}mapsxxto a block of the tree using two signals: the empirical difficulty tier emitted by tiered self\-consistency and a coarse epistemic\-domain tag\. This formulation separates*storage*from*injection*:𝒦\\mathcal\{K\}can contain many cards, but each solve reads only the block selected byθR\\theta\_\{R\}plus universal cards, and ES reads no learned cards\. Figure[1](https://arxiv.org/html/2606.00532#S2.F1)shows this path from problem, to tier/domain projection, to a small active context\.

![Refer to caption](https://arxiv.org/html/2606.00532v1/x1.png)Figure 1:KACEarchitecture\. A frozen base LLM serves as the solver\. At test time, tiered self\-consistency supplies an empirical difficulty tier and the domain classifier supplies an epistemic\-domain tag; together they project the epistemic tree𝒦\\mathcal\{K\}to a single \(difficulty, domain\) block\. ES reads no learned cards, while MS and HS read only the cards under the selected block\. The pipeline is locked: ES \(22attempts,2/22/2unanimous gate\)→\\toMS \(33attempts,2/32/3majority gate\)→\\toHS \(55attempts,2/52/5plurality gate\), with a pooled\-of\-1010plurality fallback\.#### Relation to prior context methods\.

ACE, GEPA, and flat retrieval can each be viewed as restrictions of this context function\. ACE collapses both axes into one growing prompt; GEPA further compresses the learned object into a single natural\-language artifact; flat retrieval keeps a flat𝒦\\mathcal\{K\}and replaces the hard projection with top\-kksimilarity search\.KACEpreserves the difficulty and domain axes and makes the test\-time projection explicit\.

### 2\.2Tiered Self\-Consistency and Tree Reading

The projectionθR\\theta\_\{R\}is supplied by a locked tiered self\-consistency pipeline\. A coarse classifier first assigns a domain tagdd\. The solver then runs three escalating tiers: ES uses two attempts and exits on2/22/2agreement, MS uses three attempts and exits on2/32/3agreement, and HS uses five attempts and exits on a2/52/5plurality\. If no tier exits, the system pools all1010attempts and returns the pooled plurality when available\. The tier at which the gate fires is the empirical difficulty signal used byθR\\theta\_\{R\}\.

At each learned\-card tier, the solver reads only the matching node of the tree\. ES is a card\-free gate: it reads the problem only and exits only on2/22/2agreement\. MS reads𝒦\(MS,d\)\\mathcal\{K\}\_\{\(\\mathrm\{MS\},d\)\}and HS reads𝒦\(HS,d\)\\mathcal\{K\}\_\{\(\\mathrm\{HS\},d\)\}, each together with universal cards\. The inference shape is fixed across experiments; only the learned tree𝒦\\mathcal\{K\}varies\.

Table 1:Locked tiered self\-consistency\. The shape, per\-tier exit gates, and fallback chain are invariant under all experiments\. The card node read at each learned\-card tier is determined by strict per\-tier difficulty visibility \(ES→\\tono cards, MS→\\to\{medium, universal\}, HS→\\to\{hard, universal\}\) intersected with the active domain tag\.
### 2\.3Offline Construction of the Epistemic Tree

The offline learner constructs𝒦\\mathcal\{K\}from a training splitT=\{\(xi,yi\)\}T=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\. Conceptually, it seeks to reduce the empirical answer loss

ℒ​\(𝒦\)=1\|T\|​∑\(xi,yi\)∈Tℓ​\(yi,πS​\(C​\(xi;θR,𝒦\)\)\),\\mathcal\{L\}\(\\mathcal\{K\}\)=\\frac\{1\}\{\|T\|\}\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in T\}\\ell\\\!\\left\(y\_\{i\},\\pi\_\{S\}\(C\(x\_\{i\};\\theta\_\{R\},\\mathcal\{K\}\)\)\\right\),\(4\)whereπS\\pi\_\{S\}is the frozen solver andℓ\\ellis answer\-level error\. SinceπS\\pi\_\{S\}, self\-consistency voting, and card selection are black\-box and non\-smooth, we do not compute gradients\. Instead, each failed trace supplies a natural\-language residual: a local description of the knowledge missing under the block that was actually read\.

#### Phase 1 — ADD as Block\-local context learning from failed traces\.

At epochnn, we run the tiered solver with the current tree𝒦\(n\)\\mathcal\{K\}^\{\(n\)\}\. This induces a block assignment

bi\(n\)=θR​\(xi;𝒦\(n\)\)b\_\{i\}^\{\(n\)\}=\\theta\_\{R\}\(x\_\{i\};\\mathcal\{K\}^\{\(n\)\}\)\(5\)and, for each blockbb, a residual set

Rb\(n\)=\{τi\(n\):bi\(n\)=b,πS​\(C​\(xi;θR,𝒦\(n\)\)\)≠yi\}\.R\_\{b\}^\{\(n\)\}=\\\{\\,\\tau\_\{i\}^\{\(n\)\}:b\_\{i\}^\{\(n\)\}=b,\\;\\pi\_\{S\}\(C\(x\_\{i\};\\theta\_\{R\},\\mathcal\{K\}^\{\(n\)\}\)\)\\neq y\_\{i\}\\,\\\}\.\(6\)The ADD phase updates each block using only its own residuals:

𝒦~b\(n\+1\)=𝒦b\(n\)∪Ab​\(Rb\(n\),𝒦b\(n\)\),\\widetilde\{\\mathcal\{K\}\}\_\{b\}^\{\(n\+1\)\}=\\mathcal\{K\}\_\{b\}^\{\(n\)\}\\cup A\_\{b\}\(R\_\{b\}^\{\(n\)\},\\mathcal\{K\}\_\{b\}^\{\(n\)\}\),\(7\)whereAbA\_\{b\}is a teacher\-reflection operator that proposes compact cards for the missing lemmas, invariants, reductions, or verification checks observed inRb\(n\)R\_\{b\}^\{\(n\)\}\. This is a block\-local analog of textual\-gradient and reflective prompt\-update methods\(Pryzantet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib18); Agrawalet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib15); Zhanget al\.,[2025b](https://arxiv.org/html/2606.00532#bib.bib1)\)\. The reasonKACEcan add many cards in one phase is that it is not making one large edit to a single prompt; it is making local edits to conditionally activated blocks of𝒦\\mathcal\{K\}\. This follows the structural intuition of block\-coordinate and block\-successive minimization: optimize pieces of a large object using local surrogate information while the global decomposition is held fixed\(Razaviyaynet al\.,[2013](https://arxiv.org/html/2606.00532#bib.bib52)\)\.

#### Phase 2 — REFINE under the induced post\-update distribution\.

The provisional tree𝒦~\(n\+1\)\\widetilde\{\\mathcal\{K\}\}^\{\(n\+1\)\}changes the context seen by the solver, and therefore changes which tiers fire, which cards are read, and which residual failures remain\. A card that looked useful under𝒦\(n\)\\mathcal\{K\}^\{\(n\)\}may become redundant, misplaced, or harmful under the new tree\. We therefore run a second pass with𝒦~\(n\+1\)\\widetilde\{\\mathcal\{K\}\}^\{\(n\+1\)\}and collect traces

τ~i\(n\+1\)∼πS​\(C​\(xi;θR,𝒦~\(n\+1\)\)\)\.\\widetilde\{\\tau\}\_\{i\}^\{\(n\+1\)\}\\sim\\pi\_\{S\}\(C\(x\_\{i\};\\theta\_\{R\},\\widetilde\{\\mathcal\{K\}\}^\{\(n\+1\)\}\)\)\.\(8\)These traces define post\-update residual sets\{R~b\(n\+1\)\}b∈ℬ\\\{\\widetilde\{R\}\_\{b\}^\{\(n\+1\)\}\\\}\_\{b\\in\\mathcal\{B\}\}\. The curator then applies

𝒦\(n\+1\)=Q​\(𝒦~\(n\+1\),\{R~b\(n\+1\)\}b∈ℬ\),\\mathcal\{K\}^\{\(n\+1\)\}=Q\\\!\\left\(\\widetilde\{\\mathcal\{K\}\}^\{\(n\+1\)\},\\\{\\widetilde\{R\}\_\{b\}^\{\(n\+1\)\}\\\}\_\{b\\in\\mathcal\{B\}\}\\right\),\(9\)whereQQmay keep, edit, relocate, or deprecate existing cards, but does not introduce new ones\. This second pass addresses the same induced\-distribution issue formalized by DAgger\-style dataset aggregation\(Rosset al\.,[2011](https://arxiv.org/html/2606.00532#bib.bib53)\): corrections should be evaluated under the behavior distribution induced by the current policy, not only under traces collected before the update\. InKACE, the relevant policy is the frozen solver wrapped by the current context function\. REFINE therefore evaluates cards in the context in which they will actually be used\. After a small fixed number of epochs, the entire tree is frozen before validation or test evaluation\. The full pseudocode of the offline learner — including the per\-cell sweep order, the paired leave\-one\-out lift used byImpact, and the boundedRefinebudgetM∈\{1,2\}M\\\!\\in\\\!\\\{1,2\\\}— is given as Algorithm[1](https://arxiv.org/html/2606.00532#alg1)in Appendix[B](https://arxiv.org/html/2606.00532#A2)\.

#### Pointers to appendix material\.

The full card schema, strict per\-tier visibility table, and JSON representation are given in Appendix[C](https://arxiv.org/html/2606.00532#A3); two worked example cards from the frozen AIME tree are reproduced in Appendix[D](https://arxiv.org/html/2606.00532#A4); verbatim prompts forBuildEpistemicTree,FindKnowledge, the curator, and the per\-tier solvers are listed in Appendix[E](https://arxiv.org/html/2606.00532#A5); per\-tier temperatures, output\-token caps, and seed\-averaging protocol are in Appendix[F](https://arxiv.org/html/2606.00532#A6)\.

## 3Experiments and Results

### 3\.1Datasets and Setup

We evaluateKACEon three challenging mathematical reasoning benchmarks\. We define the empirical difficulty of each problem as the fraction of independent sampled solver attempts that produce the correct answer; the formal definition and the concordance of this signal with our tier\-exit ordering are reported in §[3\.4](https://arxiv.org/html/2606.00532#S3.SS4)\. The offline construction loop reads only training splits; validation splits are used for hyper\-parameter selection but never feed back into𝒦\\mathcal\{K\}\. All test\-set numbers are computed after the tree is frozen and are averaged over33independent solver\-sampling seeds\. The frozen tree is identical across seeds\.

AIME 2025\.The 2025 American Invitational Mathematics Examination \(AIME\), administered by the Mathematical Association of America, comprises short\-answer competition problems whose solutions are integers in\[0,999\]\[0,999\]\. We use the official 30\-problem AIME 2025 contest as the held\-out test set and use 90 problems drawn from prior\-year AIME contests for training and validation, partitioned 45/45; no AIME 2025 test problem appears in training or validation\.MATH\-HARD \(L4–L5\)\.The Hendrycks MATH dataset\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.00532#bib.bib43)\)contains 12,500 competition problems annotated with difficulty levels\. We restrict to Level\-4 and Level\-5 problems sampled across the seven topical domains and partition them into 50 train, 50 validation, and 50 test problems\. This subset isolates the regime where the base solver has non\-trivial error rate\.OlymMATH\-EN \(verifiable subset\)\.OlymMATH\(Sunet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib44)\)contains 200 Olympiad\-level mathematical reasoning problems with bilingual statements\. We restrict to the English subset and to problems whose final answer is a single closed\-form expression that can be checked automatically; this yields 97 problems, partitioned 30/40/27 into train/validation/test\.

#### Solvers and inference protocol\.

AIME 2025 and OlymMATH\-EN usegpt\-4\.1\-mini; MATH\-HARD usesgpt\-4o\-mini\. The locked tiered self\-consistency shape is invariant across experiments\. The offline construction budget is two epochs for AIME 2025 and OlymMATH\-EN, and four epochs for MATH\-HARD; the epoch count is selected on the validation split\. Full per\-tier temperatures, output\-token caps, domain\-classifier settings, DIPPER\(Lauet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib32)\)perturbations, seed\-averaging protocol, and card\-format details are reported in Appendix[F](https://arxiv.org/html/2606.00532#A6)\. The size of the frozen tree per dataset \(active card count, total tokens, and average tokens per card\) is reported in Appendix Table[9](https://arxiv.org/html/2606.00532#A6.T9); the AIME tree contains132132active cards totaling∼\\sim33,600 tokens, of which only the cards under the active \(difficulty, domain\) node \(∼\\sim1000–5000 tokens\) reach the solver per problem\.

### 3\.2Why Flat Learned Context Is Insufficient

During accumulation of monolithic context, ACE\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.00532#bib.bib1)\)degrades on hard reasoning\. Manually partitioning the playbook by domain mitigates this degradation and brings accuracy in line with GEPA\. Table[2](https://arxiv.org/html/2606.00532#S3.T2)reports a single\-call comparison on AIME 2025—one solver attempt per problem—and motivates the storage\-vs\-usage demarcation used byKACE\.

Table 2:Diagnostic single\-call comparison on AIME 2025\. ACE accumulation degrades the base solver; domain partitioning recovers accuracy; GEPA resists bloat through compact prompt optimization but remains bounded by one active prompt\.A single ACE playbook accumulated in two or more epochs underperforms even the unaugmented baseline \(38\.5%38\.5\\%vs\.42\.5%42\.5\\%\)\. Inspecting the failures, we observed the standard symptom of context bloat: irrelevant guidance crowds out load\-bearing entries, and the solver is biased by surface\-level similarity between the prompt and unrelated playbook items\. GEPA\-style compact\-prompt compression\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib15)\)sidesteps this failure \(51%51\\%\), but the durable knowledge body is still bounded by a single prompt\.

Partitioning the playbook by problem domain recovers and improves on the baseline after 2 epochs \(52\.2%52\.2\\%\), but beyond that, the domain\-partitioned playbooks themselves bloat: within a single domain, hard\-problem rubrics begin to dilute easy\-problem identities, and accuracy returns toward the ACE trajectory\. We read this as evidence that domain partitioning addresses topic relevance but not difficulty relevance\.

### 3\.3Tiered Self\-Consistency vs\. Fixed Best\-of\-NN

Before adding any learned knowledge, we evaluate the tiered shape itself against fixed Best\-of\-55self\-consistency\(Wanget al\.,[2022](https://arxiv.org/html/2606.00532#bib.bib6)\)\. This tests whether the tier signal used byKACEis useful even before any cards are read; the broader motivation that compute should be allocated by problem difficulty rather than uniformly is established by Adaptive\-Consistency\(Aggarwalet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib50)\)and compute\-optimal scaling\(Snellet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib39)\)\.

Table 3:Tiered self\-consistency at comparable compute to fixed Best\-of\-55\. Average solver calls per problem are reported per dataset and computed by mixing the per\-tier exit\-gate fractions \(ES exits at 2, MS at 5, HS at 10, fallback at 10\)\. On all three benchmarks tiered Best\-of\-NNmatches or exceeds Best\-of\-55, giving us a difficulty signal without training a separate classifier\.Across all three benchmarks, tiered Best\-of\-NNmatches or exceeds Best\-of\-55\(Table[3](https://arxiv.org/html/2606.00532#S3.T3)\)\. Per\-method per\-dataset call counts \(including the best\-snapshotKACErun\) are tabulated in Appendix Table[8](https://arxiv.org/html/2606.00532#A6.T8)\. The fullKACEsystem inherits this headroom: learned context is added on top of a compute\-adaptive inference shape rather than on top of a uniform sampler\.

### 3\.4Tiered Self\-Consistency as a Difficulty Classifier

The same exit path that supplies tiered Best\-of\-NNalso acts as the empirical difficulty signal consumed by the projectionθR\\theta\_\{R\}\. We test whether the ordinal ES→\\toMS→\\toHS→\\tofallback ranking preserves the model’s own per\-problem solve rate, defined as the fraction of independent samples that answer the problem correctly acrossS=3S\{=\}3seeds \(no human difficulty labels enter this comparison\)\.

Table 4:Tier\-exit ordering vs\. empirical solve\-rate ordering\. Pairwise concordance is the fraction of problem pairs whose tier ranking matches their solve\-rate ranking;50%50\\%is the random baseline\. Per\-bucket breakdowns and the dominant failure mode \(VERY\-HARD problems on which all attempts agree on a wrong answer\) are deferred to Appendix[G](https://arxiv.org/html/2606.00532#A7)\.The tier\-exit predictor preserves78%78\\%of AIME 2025 problem pairs and80%80\\%of MATH\-HARD pairs, well above the50%50\\%random baseline\. This justifies using the tier signal directly as the difficulty coordinate ofθR\\theta\_\{R\}, without training a separate classifier\.

### 3\.5Adding Learned Context: ACE, GEPA, andKACE

We now add learned context to the tiered pipeline and compare flat learned\-context baselines against the difficulty\- and domain\-stratified epistemic tree\. The final pipeline keeps Easy Solvers \(ES\) card\-free\. In pilot runs, allowing learned cards in ES increased the number of ES exits, but it also increased overconfident wrong exits; therefore, we keep ES as a problem\-only2/22/2unanimity gate and inject learned cards only after escalation to MS or HS\.

Table 5:Main comparison across three benchmarks\. Tiered rows use the locked tiered pipeline;*Tiered\+\+ACE*adds a one\-epoch compact lesson playbook;*Tiered\+\+GEPA*adds a two\-epoch GEPA\-optimized prompt; fullKACEuses the frozen epistemic tree\.Adding ACE\-style pithy lessons to the tiered shape improves AIME 2025 over tiering alone \(54\.5%54\.5\\%vs\.52\.2%52\.2\\%\), but regresses on MATH\-HARD and OlymMATH\-EN\. This is the same context\-bloat sensitivity observed in the diagnostic setting: tier placement helps, but without a domain\-stratified tree the active context still accumulates irrelevant guidance\.

Tiered\+\+GEPA is the strongest learned\-context baseline on AIME 2025 \(56\.6%56\.6\\%\)\. Because GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib15)\)and ACE\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.00532#bib.bib1)\)both optimize a single prompt artifact, we apply that prompt at the hard solver only, where it returns the maximum benefit; both baselines used Sonnet 4\.6 as the teacher reflector and were run at compute budgets comparable toKACE\. GEPA’s compact prompt avoids the worst ACE accumulation failure, but the durable knowledge body remains capped by a single prompt\.

FullKACEreaches62\.2%62\.2\\%on AIME 2025, a10\.410\.4\-point absolute gain over Best\-of\-55and a5\.65\.6\-point gain over Tiered\+\+GEPA,77\.3%77\.3\\%on MATH\-HARD, and48\.3%48\.3\\%on OlymMATH\-EN\. All headline numbers are 3\-seed averages over the frozen tree\. The pattern is consistent with the central claim: the gains come from combining compute\-adaptive tiering with a knowledge store whose capacity is larger than, and conditionally separated from, the active prompt\.

## 4Ablations

We run AIME25\-only ablations to isolate three questions: whether the difficulty/domain axes matter, whether teacher\-distilled cards matter, and whether generic retrieval can replace the fixed tree projection\. Unless stated otherwise, variants use the same solver, locked tiered schedule, and learned card pool as fullKACE\. The learned card pool contained 132 cards across all the ablation experiments\.

Table 6:AIME25 ablations\. Rows isolate the structural axes of the tree, card quality, and the projection mechanism while keeping the locked tiered solver fixed\.#### Domains are important\.

Removing the domain axis drops AIME25 accuracy from62\.2%62\.2\\%to54\.4%54\.4\\%, even though the card pool is unchanged\. The flattening of both axes into one monolithic context drops further to51\.6%51\.6\\%\. This supports the storage\-vs\-usage claim: learned cards help only when the active prompt receives the relevant slice rather than the whole store\.

#### Card quality matters\.

Replacing teacher\-distilled cards with raw solver reflections drops accuracy to59\.3%59\.3\\%\. Reflection alone produces useful signal, but the compactness and tier coherence of the distilled cards matter once each card is assigned to a specific \(difficulty, domain\) node\.

#### Generic retrieval does not replace the tree projection\.

Replacing the fixed difficulty\-domain projection with an LLM router reaches56\.4%56\.4\\%, and replacing it with an embedding retriever reaches56\.2%56\.2\\%\. Their near\-identical performance suggests that the deficit is not specific to one routing implementation; on AIME25, the structured projection supplied by tiered self\-consistency and domain stratification is more reliable than generic retrieval over the same card pool\.

## 5Discussion

The results support a simple view of context engineering for hard mathematical reasoning: the bottleneck is not only how much useful knowledge the system has stored, but whether it can expose the right part of that knowledge at the right time\. Monolithic playbooks and flattened card stores contain useful information, but they also increase distractor mass in the active prompt\.KACEimproves by separating the durable knowledge body from the small context slice read by the solver\.

The ablations provide evidence that explicit epistemic and difficulty boundaries are more effective here than generic routing over the same learned store\. Replacing the fixed difficulty\-domain projection with an LLM router reaches56\.4%56\.4\\%on AIME25, and replacing it with an embedding retriever reaches56\.2%56\.2\\%, both below fullKACEat62\.2%62\.2\\%\. On the AIME25 ablation, this points to the useful retrieval unit being not just the nearest card or the card selected by a general router, but the branch defined by the problem’s inferred domain and empirical difficulty tier\. However, we believe that for future work, scaling of KB size will require KACE like epistemic and difficulty demarcation, but also embedding retrievers\. The same tier signal that selects the difficulty branch is itself calibrated against per\-problem solve rate \(§[3\.4](https://arxiv.org/html/2606.00532#S3.SS4)\), so the projectionθR\\theta\_\{R\}inherits a difficulty coordinate that does not require an external classifier — the cost of which would be additional supervision on a small held\-out split\.

## 6Limitations

We list four limitations that condition the claims above\.

1. 1\.Search\-space limitation\.KACEand other context\-based techniques work when errors repeat or when the reflection engine can generalize errors across a repeatable space\. However, this does not replace the true marker of progress: exploration done by RL or pre\-training\. A structured tree of cards can condense lessons, but search compute cost still needs to be spent\. Our work does not operationalize the search mechanism needed to scale LLM reasoning during context\-based learning; this is left for future work\. Two concrete attempts at this scaling are recorded as appendix\-only negative results: a context\-folding “synthesis” step over disagreeing HS attempts \(Appendix[H](https://arxiv.org/html/2606.00532#A8)\) produced essentially no accuracy change because no step\-level verifier could localize where the disagreement actually went wrong, and an “imagination” loop that expanded the AIME tree from 132 to 450 cards \(Appendix[I](https://arxiv.org/html/2606.00532#A9)\) raised validation accuracy while test accuracy plateaued and then drifted down — the canonical signature of overfitting once the validation residual is exhausted\.
2. 2\.Cross\-benchmark transfer is not automatic\.Cards distilled from AIME\-style failure traces do not help on OlymMATH\-EN, where the failure modes differ\.
3. 3\.Our method can scale tree size but we do not report such results\.KACEcan in principle grow the tree along both axes, but we do not utilize the full capacity because more epochs did not improve validation results\. This was due to stale insights from already\-learned trajectories rather than tree\-size bottlenecks\.
4. 4\.Single\-provider solver stack\.All experiments use OpenAI models for the solver, router, and coarse domain classifier\. Whether the tree transfers when the solver and the readers of the tree come from unrelated model families — for example a stronger solver consulting a tree built with a weaker teacher — is not yet tested\.

## 7Broader Impacts

KACEadvances context engineering on the hardest end of mathematical reasoning, where LLM systems are most likely to fail and where small accuracy gains can matter in settings where difficult reasoning bottlenecks downstream workflows\. The positive implications extend beyond a single benchmark: durable, inspectable, structured knowledge stores layered around a frozen base model are a building block for more intelligent systems built around LLMs — mathematical tutoring, verification of student work, automated checking of scientific calculation, and reasoning\-heavy assistant workflows in adjacent domains such as programming and formal verification, all of which depend on consulting the right knowledge under uncertainty rather than on scaling parametric knowledge alone\. Reliability is the second axis along which we expect impact\. BecauseKACEstores its learned knowledge as natural\-language cards organized by \(difficulty, domain\) node, what the system has “learned” is auditable in a way that weight\-updated alternatives are not; this inspectability is a small but real alignment\-favorable property as such systems are deployed in higher\-stakes contexts\. We do not foresee a direct negative\-use path beyond the standard caution that any improvement in LLM math accuracy can encourage over\-reliance on LLM outputs in high\-stakes evaluation, and the confident\-wrong failure mode characterized in Appendix[G](https://arxiv.org/html/2606.00532#A7)remains a real risk that any deployment must address\.

## 8Conclusion

We introducedKACE, a framework that treats context engineering as the offline construction of an epistemic tree of typed knowledge cards, consumed at test\-time through a locked tiered self\-consistency pipeline\. By organizing the durable knowledge body along the two axes that govern relevance — difficulty and epistemic domain — and by using disagreement\-driven tier escalation to read only the matching learned branch after ES,KACEoffers a structured alternative to monolithic playbooks and fixed Best\-of\-NNsampling\. The clean demarcation between offline epistemic structuring and online difficulty\-aware consumption substantially improves hard mathematical reasoning in our experiments, and the methodology provides a broader template for studying context as a structured, conditionally\-read resource\.

## References

- Let’s sample step by step: adaptive\-consistency for efficient reasoning and coding with llms\.InEMNLP,Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.00532#S1.p2.11),[§3\.3](https://arxiv.org/html/2606.00532#S3.SS3.p1.1)\.
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2025\)GEPA: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1),[§1](https://arxiv.org/html/2606.00532#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.00532#S2.SS3.SSS0.Px1.p1.6),[§3\.2](https://arxiv.org/html/2606.00532#S3.SS2.p2.3),[§3\.5](https://arxiv.org/html/2606.00532#S3.SS5.p3.2)\.
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InICLR,Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk, and T\. Hoefler \(2024\)Graph of thoughts: solving elaborate problems with large language models\.InAAAI,Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1)\.
- R\. Chang, D\. Kwon, J\. Lee, and N\. Verma \(2026\)CascadeDebate: multi\-agent deliberation for cost\-aware llm cascades\.arXiv preprint arXiv:2604\.12262\.Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.00532#S1.p2.11)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1)\.
- C\. Fernando, D\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktäschel \(2023\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.arXiv preprint arXiv:2309\.16797\.Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1)\.
- B\. Gao, F\. Song, Z\. Yang, Z\. Cai, Y\. Miao, Q\. Dong, L\. Li, C\. Ma, L\. Chen, R\. Xu, Z\. Tang, B\. Wang, D\. Zan, S\. Quan, G\. Zhang, L\. Sha, Y\. Zhang, X\. Ren, T\. Liu, and B\. Chang \(2024\)Omni\-math: a universal olympiad level mathematic benchmark for large language models\.arXiv preprint arXiv:2410\.07985\.Cited by:[§A\.9](https://arxiv.org/html/2606.00532#A1.SS9.p1.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. L\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.arXiv preprint arXiv:2402\.14008\.Cited by:[§A\.9](https://arxiv.org/html/2606.00532#A1.SS9.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§A\.9](https://arxiv.org/html/2606.00532#A1.SS9.p1.1),[§3\.1](https://arxiv.org/html/2606.00532#S3.SS1.p2.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2024\)Automated design of agentic systems\.arXiv preprint arXiv:2408\.08435\.Cited by:[§A\.7](https://arxiv.org/html/2606.00532#A1.SS7.p1.1)\.
- S\. Jeong, J\. Baek, S\. Cho, S\. J\. Hwang, and J\. C\. Park \(2024\)Adaptive\-rag: learning to adapt retrieval\-augmented large language models through question complexity\.InNAACL,Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.InICLR,Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1)\.
- Y\. Kim, C\. Park, H\. Jeong, Y\. S\. Chan, X\. Xu, D\. McDuff, H\. Lee, M\. Ghassemi, C\. Breazeal, and H\. W\. Park \(2024\)MDAgents: an adaptive collaboration of llms for medical decision\-making\.InNeurIPS,Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.00532#S1.p2.11)\.
- G\. K\. R\. Lau, W\. Hu, D\. Liu, J\. Chen, S\. Ng, and B\. K\. H\. Low \(2024\)Dipper: diversity in prompts for producing large language model ensembles in reasoning tasks\.arXiv preprint arXiv:2412\.15238\.Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1),[§3\.1](https://arxiv.org/html/2606.00532#S3.SS1.SSS0.Px1.p1.3)\.
- H\. Lee, L\. Soldaini, A\. Cohan, M\. Seo, and K\. Lo \(2024\)RouterRetriever: routing over a mixture of expert embedding models\.arXiv preprint arXiv:2409\.02685\.Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- Y\. Li, Z\. Lin, S\. Zhang, Q\. Fu, B\. Chen, J\. Lou, and W\. Chen \(2023\)Making language models better reasoners with step\-aware verifier\.InACL,Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.arXiv preprint arXiv:2303\.17651\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.
- I\. Ong, A\. Almahairi, V\. Wu, W\. Chiang, T\. Wu, J\. E\. Gonzalez, M\. W\. Kadous, and I\. Stoica \(2024\)RouteLLM: learning to route llms with preference data\.arXiv preprint arXiv:2406\.18665\.Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1),[§A\.8](https://arxiv.org/html/2606.00532#A1.SS8.p1.1)\.
- K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InEMNLP,Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.
- R\. Pryzant, D\. Iter, J\. Li, Y\. T\. Lee, C\. Zhu, and M\. Zeng \(2023\)Automatic prompt optimization with “gradient descent” and beam search\.InEMNLP,Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1),[§2\.3](https://arxiv.org/html/2606.00532#S2.SS3.SSS0.Px1.p1.6)\.
- M\. Razaviyayn, M\. Hong, and Z\. Luo \(2013\)A unified convergence analysis of block successive minimization methods for nonsmooth optimization\.SIAM Journal on Optimization23\(2\),pp\. 1126–1153\.External Links:[Document](https://dx.doi.org/10.1137/120891009)Cited by:[§2\.3](https://arxiv.org/html/2606.00532#S2.SS3.SSS0.Px1.p1.6)\.
- S\. Ross, G\. J\. Gordon, and J\. A\. Bagnell \(2011\)A reduction of imitation learning and structured prediction to no\-regret online learning\.InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.15,pp\. 627–635\.Cited by:[§2\.3](https://arxiv.org/html/2606.00532#S2.SS3.SSS0.Px2.p1.6)\.
- T\. Schnabel and J\. Neville \(2024\)Symbolic prompt program search: a structure\-aware approach to efficient compile\-time prompt optimization\.InFindings of EMNLP,Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.arXiv preprint arXiv:2303\.11366\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.
- A\. Singh, J\. D\. Co\-Reyes, R\. Agarwal, A\. Anand, P\. Patil, X\. Garcia, P\. J\. Liu, J\. Harrison, J\. Lee, K\. Xu, A\. Parisi, A\. Kumar, A\. Alemi, A\. Rizkowsky, A\. Nova, B\. Adlam, B\. Bohnet, G\. F\. Elsayed, H\. Sedghi, I\. Mordatch, I\. Simpson, I\. Gur, J\. Snoek, J\. Pennington, J\. Hron, K\. Kenealy, K\. Swersky, K\. Mahajan, L\. Culp, L\. Xiao, M\. L\. Bileschi, N\. Constant, R\. Novak, R\. Liu, T\. Warkentin, Y\. Qian, Y\. Bansal, E\. Dyer, B\. Neyshabur, J\. Sohl\-Dickstein, and N\. Fiedel \(2023\)Beyond human data: scaling self\-training for problem\-solving with language models \(rest\-em\)\.arXiv preprint arXiv:2312\.06585\.Cited by:[§A\.6](https://arxiv.org/html/2606.00532#A1.SS6.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.00532#S1.p2.11),[§3\.3](https://arxiv.org/html/2606.00532#S3.SS3.p1.1)\.
- H\. Sun, Y\. Min, Z\. Chen, W\. X\. Zhao, L\. Fang, Z\. Liu, Z\. Wang, and J\. Wen \(2025\)Challenging the boundaries of reasoning: an olympiad\-level math benchmark for large language models\.arXiv preprint arXiv:2503\.21380\.Cited by:[§A\.9](https://arxiv.org/html/2606.00532#A1.SS9.p1.1),[§3\.1](https://arxiv.org/html/2606.00532#S3.SS1.p2.1)\.
- M\. Suzgun, M\. Yuksekgonul, F\. Bianchi, D\. Jurafsky, and J\. Zou \(2025\)Dynamic cheatsheet: test\-time learning with adaptive memory\.arXiv preprint arXiv:2504\.07952\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.
- V\. Y\.H\. Toh, D\. Ghosal, and S\. Poria \(2024\)Not all votes count\! programs as verifiers improve self\-consistency of language models for math reasoning\.arXiv preprint arXiv:2410\.12608\.Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. X\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024a\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InACL,Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1),[§A\.6](https://arxiv.org/html/2606.00532#A1.SS6.p1.1)\.
- X\. Wang, C\. Li, Z\. Wang, F\. Bai, H\. Luo, J\. Zhang, N\. Jojic, E\. Xing, and Z\. Hu \(2024b\)PromptAgent: strategic planning with language models enables expert\-level prompt optimization\.InICLR,Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.00532#S1.p2.11),[§3\.3](https://arxiv.org/html/2606.00532#S3.SS3.p1.1)\.
- Z\. Wang, J\. Araki, Z\. Jiang, M\. R\. Parvez, and G\. Neubig \(2023\)Learning to filter context for retrieval\-augmented generation\.arXiv preprint arXiv:2311\.08377\.Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InNeurIPS,Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1)\.
- D\. Wu, W\. U\. Ahmad, D\. Zhang, M\. K\. Ramanathan, and X\. Ma \(2024\)Repoformer: selective retrieval for repository\-level code completion\.arXiv preprint arXiv:2403\.10059\.Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- D\. Wu, J\. Gu, K\. Chang, and N\. Peng \(2025\)Self\-routing rag: binding selective retrieval with knowledge verbalization\.arXiv preprint arXiv:2504\.01018\.Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.
- S\. Yan, J\. Gu, Y\. Zhu, and Z\. Ling \(2024\)Corrective retrieval augmented generation\.arXiv preprint arXiv:2401\.15884\.Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InICLR,Cited by:[§A\.3](https://arxiv.org/html/2606.00532#A1.SS3.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InNeurIPS,Cited by:[§A\.5](https://arxiv.org/html/2606.00532#A1.SS5.p1.1)\.
- Z\. Yao, W\. Qi, L\. Pan, S\. Cao, L\. Hu, W\. Liu, L\. Hou, and J\. Li \(2025\)SeaKR: self\-aware knowledge retrieval for adaptive retrieval\-augmented generation\.InACL,Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- H\. Ye, X\. He, V\. Arak, H\. Dong, and G\. Song \(2026\)Meta context engineering via agentic skill evolution\.arXiv preprint arXiv:2601\.21557\.Cited by:[§A\.1](https://arxiv.org/html/2606.00532#A1.SS1.p3.1)\.
- L\. Yu, W\. Jiang, H\. Shi, J\. Yu, Z\. Liu, Y\. Zhang, J\. T\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2023\)MetaMath: bootstrap your own mathematical questions for large language models\.arXiv preprint arXiv:2309\.12284\.Cited by:[§A\.6](https://arxiv.org/html/2606.00532#A1.SS6.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2022\)STaR: bootstrapping reasoning with reasoning\.arXiv preprint arXiv:2203\.14465\.Cited by:[§A\.6](https://arxiv.org/html/2606.00532#A1.SS6.p1.1)\.
- J\. Zhang, X\. Liu, Y\. Hu, C\. Niu, F\. Wu, and G\. Chen \(2025a\)RAGRouter: learning to route queries to multiple retrieval\-augmented language models\.arXiv preprint arXiv:2505\.23052\.Cited by:[§A\.4](https://arxiv.org/html/2606.00532#A1.SS4.p1.1)\.
- J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang, B\. Zheng, B\. Liu, Y\. Luo, and C\. Wu \(2024\)AFlow: automating agentic workflow generation\.arXiv preprint arXiv:2410\.10762\.Cited by:[§A\.7](https://arxiv.org/html/2606.00532#A1.SS7.p1.1)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun \(2025b\)Agentic context engineering: evolving contexts for self\-improving language models\.arXiv preprint arXiv:2510\.04618\.Cited by:[§A\.1](https://arxiv.org/html/2606.00532#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.00532#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.00532#S2.SS3.SSS0.Px1.p1.6),[§3\.2](https://arxiv.org/html/2606.00532#S3.SS2.p1.1),[§3\.5](https://arxiv.org/html/2606.00532#S3.SS5.p3.2)\.
- Y\. Zhang, J\. Shu, Y\. Ma, X\. Lin, S\. Wu, and J\. Sang \(2025c\)Memory as action: autonomous context curation for long\-horizon agentic tasks\.arXiv preprint arXiv:2510\.12635\.Cited by:[§A\.2](https://arxiv.org/html/2606.00532#A1.SS2.p1.1)\.

## Appendix ARelated Work

### A\.1ACE, GEPA, and MCE

ACE frames inference\-time improvement as context engineering through evolving playbooks\[Zhanget al\.,[2025b](https://arxiv.org/html/2606.00532#bib.bib1)\]\. Its strength is that it learns reusable context from experience, but this same accumulation can become a liability on hard math benchmarks if irrelevant guidance grows faster than useful guidance\.KACEaddresses this limitation by replacing one growing playbook with a structured tree of typed cards stratified by difficulty and domain\.

GEPA\-style systems are closely related in spirit because they also improve prompts or reasoning behavior through iterative feedback and guided refinement\. The main difference is thatKACEis not optimizing a single prompt artifact; it learns a structured tree of reusable knowledge cards and consumes that tree under an empirical difficulty signal at test\-time\. This makes the learning target richer than prompt improvement alone, and the storage capacity unbounded by the size of one active prompt\.

MCE takes a bi\-level meta\-learning view in which context\-engineering skills and context artifacts co\-evolve at evaluation time\[Yeet al\.,[2026](https://arxiv.org/html/2606.00532#bib.bib2)\]\.KACEoccupies a deliberately simpler design point\. We are*not*performing meta\-learning: there is no inner\-loop optimizer over skills, no outer\-loop optimizer over context\-engineering operators, and no co\-evolution at evaluation\. Instead, the epistemic tree is constructed once on a training split, scored by paired train\-verify lift, and frozen at evaluation; only the test\-time projection is dynamic per problem\. We also draw a sharper line between*skills*\(how to reason or intervene\) and*knowledge*\(what should be surfaced for a problem family\):KACEoptimizes the latter exclusively\. The empirical bet is that a well\-structured, frozen tree of typed cards can already capture the bulk of the gains attributed to richer meta\-learning machinery, at substantially lower system complexity\.

### A\.2Reflection, Memory, and Self\-Improvement

Reflexion, Self\-Refine, and related self\-improvement systems show that language feedback can serve as an optimization signal at inference time\[Shinnet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib4), Madaanet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib5)\]\. Dynamic Cheatsheet further shows that persistent memory can help difficult reasoning by reusing validated insights across tasks\[Suzgunet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib3)\]\. A more recent line argues that*what stays in working context*is itself a learnable decision: Memory\-as\-Action\[Zhanget al\.,[2025c](https://arxiv.org/html/2606.00532#bib.bib10)\]treats context curation as an explicit policy, A\-MEM\[Xuet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib11)\]and Mem0\[Chhikaraet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib12)\]build production\-oriented agentic memory stacks, and MemGPT\[Packeret al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib13)\]formalizes context management as an OS\-style problem\.*Lost in the Middle*\[Liuet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib14)\]establishes empirically that long context is not free, which motivates keeping the active prompt small\.KACEbuilds on these ideas but makes two changes\. First, it converts raw feedback into typed knowledge cards keyed by failure mode and placed into a \(difficulty, domain\) node rather than concatenated into one stream\. Second, it consults the cards*conditionally*on the empirical difficulty signal at test\-time, by exiting card\-free at ES and reading only the matching branch after escalation, rather than applying memory uniformly\.

### A\.3Prompt Evolution and Modular LM Programs

A separate family of methods compresses experience into a compact natural\-language artifact through reflective or evolutionary search\. GEPA\[Agrawalet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib15)\]is our primary foil: its compact\-prompt compression is real and powerful, but bounded by the size of a single prompt\. DSPy and MIPRO\[Khattabet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib16), Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib17)\]optimize instructions and demonstrations for multi\-stage LM programs; ProTeGi\[Pryzantet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib18)\]performs gradient\-style prompt search; OPRO\[Yanget al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib19)\]treats LLMs as optimizers; Promptbreeder\[Fernandoet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib20)\]evolves prompts via self\-referential mutation; PromptAgent\[Wanget al\.,[2024b](https://arxiv.org/html/2606.00532#bib.bib21)\]introduces planning\-style prompt search; and SAMMO\[Schnabel and Neville,[2024](https://arxiv.org/html/2606.00532#bib.bib22)\]performs symbolic prompt\-program search\. Each of these methods defines a unit of optimization — an instruction, a demonstration, a program structure, a search tree\.KACEextends the list with a different unit: a typed knowledge card that lives at a node of an epistemic tree and does not need to live in any single active prompt\.

### A\.4Adaptive Retrieval and Learned Source Selection

A third line of work treats retrieval, retrieval source, or model choice as a learned action\. Self\-RAG\[Asaiet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib23)\]casts retrieval as a reflection token, deciding per step whether to retrieve\. Adaptive\-RAG\[Jeonget al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib24)\]routes queries by question complexity to retrieval pipelines of different cost\. RouterRetriever\[Leeet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib25)\], Self\-Routing RAG\[Wuet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib26)\], RAGRouter\[Zhanget al\.,[2025a](https://arxiv.org/html/2606.00532#bib.bib27)\], Corrective RAG\[Yanet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib28)\], and SeaKR\[Yaoet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib29)\]each route among retrievers, sources, or repair operators\. RouteLLM\[Onget al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib9)\]routes among models\. Two further design points are particularly aligned withKACE: Repoformer\[Wuet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib48)\]self\-supervises selective retrieval by*whether retrieval improves the downstream model*, which is exactly the labeling philosophy behind our paired train\-verify helpfulness scores; and FILCO\[Wanget al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib49)\]learns to filter retrieved context by utility, with an explicit “show nothing” action, of which our strict non\-cumulative per\-tier visibility \(hard cards never reach MS\) is a static, schema\-level analog\.KACE’s nearest neighbor among these methods is Adaptive\-RAG, which routes by question complexity and is analogous to our tier gate\. The novelty is to make the*difficulty signal itself*the test\-time projection that selects which slice of a learned, structured knowledge tree is read — with the difficulty signal supplied by per\-tier agreement gates rather than by an external classifier\.

### A\.5Self\-Consistency, Search, Verification, and Compute\-Optimal Scaling

Chain\-of\-thought prompting and self\-consistency established the value of sampled reasoning traces for mathematical reasoning\[Weiet al\.,[2022](https://arxiv.org/html/2606.00532#bib.bib30), Wanget al\.,[2022](https://arxiv.org/html/2606.00532#bib.bib6)\]\. DIVERSE\[Liet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib31)\]and DIPPER\[Lauet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib32)\]make diversity and verification independent axes; Tree of Thoughts\[Yaoet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib33)\]and Graph of Thoughts\[Bestaet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib34)\]generalize sampling into search\. A separate verifier line—GSM8K verifiers\[Cobbeet al\.,[2021](https://arxiv.org/html/2606.00532#bib.bib35)\], step\-level supervision\[Lightmanet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib36)\], Math\-Shepherd\[Wanget al\.,[2024a](https://arxiv.org/html/2606.00532#bib.bib37)\], programs\-as\-verifiers\[Tohet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib38)\]—establishes that selection is often more bottlenecked than coverage\. Compute\-optimal scaling\[Snellet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib39)\]shows that the right test\-time policy depends on problem difficulty, motivating tier\-conditioned compute\. Adaptive\-Consistency\[Aggarwalet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib50)\]is the closest direct prior on*difficulty\-conditioned self\-consistency*: it dynamically chooses the per\-question sampling budget via an early\-exit rule, foreshadowing per\-tier exit gates\. MDAgents\[Kimet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib46)\]and CascadeDebate\[Changet al\.,[2026](https://arxiv.org/html/2606.00532#bib.bib47)\]take the same difficulty\-routed\-compute idea outside self\-consistency proper: MDAgents switches between solo and group LLM collaboration by task complexity, and CascadeDebate alternates single\-model inference with multi\-agent deliberation at escalation boundaries across an LLM cascade\.KACE’s tiered self\-consistency shape \(§[2\.2](https://arxiv.org/html/2606.00532#S2.SS2)\) sits in the intersection of these lines, with two distinguishing ingredients: \(i\) strict, non\-overlapping per\-tier exit gates with fixed attempt counts \(2/3/52/3/5\) rather than a continuous early\-exit threshold, and \(ii\) per\-tier*context*conditioning \(different node of the epistemic tree at each tier\), so the effective inference policy is a function of difficulty in both budget*and*prompt\.

### A\.6Teacher Distillation and Synthetic Math

STaR\[Zelikmanet al\.,[2022](https://arxiv.org/html/2606.00532#bib.bib40)\], MetaMath\[Yuet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib41)\], Math\-Shepherd\[Wanget al\.,[2024a](https://arxiv.org/html/2606.00532#bib.bib37)\], and ReST\-EM\[Singhet al\.,[2023](https://arxiv.org/html/2606.00532#bib.bib42)\]distill teacher rationales or synthetic data into model weights\.KACEleaves weights frozen and distills into a structured external tree instead; we cite this line only to disambiguate, not to compete on equal terms\.

### A\.7Agentic Search and Workflow Optimization

Agentic search methods such as AFlow and ADAS search over workflows, decompositions, or system architectures instead of over static prompts\[Zhanget al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib7), Huet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib8)\]\. These methods show that the structure of inference itself can be optimized\.KACEshares this spirit but shifts the search target: the central object is not the workflow but a structured library of typed knowledge cards organized along difficulty and domain\.

### A\.8Conditional Computation and Mixtures of Experts

Routing methods such as RouteLLM dynamically allocate different models or compute levels depending on expected difficulty and cost\[Onget al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib9)\]\. Mixture\-of\-experts architectures generalize the same principle inside the model: only a subset of experts should be active for a given input\.KACEextends this conditional\-computation perspective to external context\. Instead of conditioning only among models or subnetworks, it conditions among knowledge interventions: the empirical difficulty tier and the inferred domain together pick the branch of the epistemic tree that is read into the active prompt\. That is the key conceptual bridge between conditional computation and context engineering in our framework\.

### A\.9Math Benchmarks

We evaluate on AIME 2025 \(Mathematical Association of America\), the Hendrycks MATH dataset\[Hendryckset al\.,[2021](https://arxiv.org/html/2606.00532#bib.bib43)\], OlymMATH\[Sunet al\.,[2025](https://arxiv.org/html/2606.00532#bib.bib44)\], and cite OlympiadBench\[Heet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib45)\]and Omni\-MATH\[Gaoet al\.,[2024](https://arxiv.org/html/2606.00532#bib.bib51)\]for context on olympiad\-level evaluation\.

## Appendix BOffline Learner Pseudocode

Algorithm[1](https://arxiv.org/html/2606.00532#alg1)gives the full pseudocode of theKACEoffline learner referenced in §[2\.3](https://arxiv.org/html/2606.00532#S2.SS3): a per\-cell sweep over learned\-card \(difficulty, domain\) cells in Phase 1, followed by a per\-domain revise\-and\-verify pass with boundedRefinebudget in Phase 2\.

Algorithm 1KACE Offline Learner— block\-local ADD followed by induced\-distribution REFINE\.1:training split

TT; lift threshold

τlift\\tau\_\{\\text\{lift\}\}; epoch budget

NN; refine budget

M∈\{1,2\}M\\in\\\{1,2\\\}\.

2:frozen epistemic tree

𝒦\\mathcal\{K\}over \(difficulty, domain\) nodes\.

3:

𝒟←BuildEpistemicTree​\(T\)\\mathcal\{D\}\\leftarrow\\textsc\{BuildEpistemicTree\}\(T\)⊳\\trianglerightagent call: emits domain partition\{Td\}d∈𝒟\\\{T\_\{d\}\\\}\_\{d\\in\\mathcal\{D\}\}from a filesystem\-level scan ofTT

4:

𝒦←∅\\mathcal\{K\}\\leftarrow\\emptyset
5:for

n=1,…,Nn=1,\\dots,Ndo

6:// ===== Phase 1 — populate the tree \(add new cards\) =====

7:// 1a: per\-cell knowledge sweep over learned\-card \(difficulty, domain\) cells

8:foreach tier

t∈\{MS, HS\}t\\in\\\{\\text\{MS, HS\}\\\}andeach domain

d∈𝒟d\\in\\mathcal\{D\}do

9:

τt,d←TrainRun​\(Td,𝒦\)\|t\\tau\_\{t,d\}\\leftarrow\\textsc\{TrainRun\}\(T\_\{d\},\\mathcal\{K\}\)\|\_\{t\}⊳\\trianglerighttier\-ttslice of the tiered run onTdT\_\{d\}

10:

𝒞t,d←FindKnowledge​\(τt,d,t,d\)\\mathcal\{C\}\_\{t,d\}\\leftarrow\\textsc\{FindKnowledge\}\(\\tau\_\{t,d\},t,d\)⊳\\trianglerightagent: search cell\(t,d\)\(t,d\)for missing patterns; tagged\(t,d\)\(t,d\)

11:

𝒦←Consolidate​\(𝒦∪GateJudge​\(𝒞t,d\)\)\\mathcal\{K\}\\leftarrow\\textsc\{Consolidate\}\(\\mathcal\{K\}\\cup\\textsc\{GateJudge\}\(\\mathcal\{C\}\_\{t,d\}\)\)
12:endfor

13:// ===== Phase 2 — revise existing cards via Refine\+\+verify =====

14:foreach domain

d∈𝒟d\\in\\mathcal\{D\}do

15:

τd′←TrainRun​\(Td,𝒦\)\\tau\_\{d\}^\{\\prime\}\\leftarrow\\textsc\{TrainRun\}\(T\_\{d\},\\mathcal\{K\}\)⊳\\trianglerightindependent run with the post\-Phase\-1 tree

16:

hd←Impact​\(τd′,𝒦\|d\)h\_\{d\}\\leftarrow\\textsc\{Impact\}\(\\tau\_\{d\}^\{\\prime\},\\mathcal\{K\}\|\_\{d\}\)⊳\\trianglerightpaired leave\-one\-out lift per card at node\(∗,d\)\(\*,d\)

17:

ℛd←ReflectAndReviseCards​\(𝒦\|d,τd′,hd\)\\mathcal\{R\}\_\{d\}\\leftarrow\\textsc\{ReflectAndReviseCards\}\(\\mathcal\{K\}\|\_\{d\},\\tau\_\{d\}^\{\\prime\},h\_\{d\}\)⊳\\trianglerightedit existing cards: tighten/broaden, relocate, or drop nodes

18:

𝒦cand←ApplyRevisions​\(𝒦,ℛd\)\\mathcal\{K\}\_\{\\text\{cand\}\}\\leftarrow\\textsc\{ApplyRevisions\}\(\\mathcal\{K\},\\mathcal\{R\}\_\{d\}\)
19:for

m=1,…,Mm=1,\\dots,Mdo

20:

τd\(m\)←TrainRun​\(Td,𝒦cand\)\\tau\_\{d\}^\{\(m\)\}\\leftarrow\\textsc\{TrainRun\}\(T\_\{d\},\\mathcal\{K\}\_\{\\text\{cand\}\}\)⊳\\trianglerightverification rerun

21:if

Lift​\(τd\(m\),τd′\)≥τlift\\textsc\{Lift\}\(\\tau\_\{d\}^\{\(m\)\},\\tau\_\{d\}^\{\\prime\}\)\\geq\\tau\_\{\\text\{lift\}\}then

22:

𝒦←𝒦cand\\mathcal\{K\}\\leftarrow\\mathcal\{K\}\_\{\\text\{cand\}\};break⊳\\trianglerightcommit revision

23:else

24:

𝒦cand←Refine​\(𝒦cand,τd\(m\),ℛd\)\\mathcal\{K\}\_\{\\text\{cand\}\}\\leftarrow\\textsc\{Refine\}\(\\mathcal\{K\}\_\{\\text\{cand\}\},\\tau\_\{d\}^\{\(m\)\},\\mathcal\{R\}\_\{d\}\)⊳\\trianglerightagent self\-corrects the over\-large edit

25:endif

26:endfor

27:endfor

28:endfor

29:freeze

𝒦\\mathcal\{K\}
30:return

𝒦\\mathcal\{K\}

## Appendix CKnowledge\-Base Schema

This appendix gives the full card schema used in the epistemic tree𝒦\\mathcal\{K\}, as it appears in the implementation\. The schema is intentionally lean: each card stores a content body \(the on\-disk field name in our implementation logs ispayload\), a small list ofrouting\_conditionsretained from a prior design iteration, and tier and domain tags that determine the card’s node in the tree\.

#### Card fields\.

- •card\_id\(str, unique\)\.
- •payload\(str\): the natural\-language content of the card, shown to the solver when the active node of the tree is read; for medium cards this is typically one identity plus one sanity\-check, for hard cards a44–88\-step rubric\. The payload is acquired during the offline construction loop and may be edited during REFINE in the current or future epochs\.
- •routing\_conditions\(list\[str\]\): a small list of natural\-language clauses \(selection criteria and exclusion clauses\) attached to the card on disk\. At runtime, these clauses are surfaced in the card context so the model can choose better from the eligible tier\-domain pool; node position determines eligibility, and routing conditions help select and apply cards within that pool\.
- •difficulty\_tag∈\{medium,hard,universal\}\\in\\\{\\text\{medium\},\\text\{hard\},\\text\{universal\}\\\}: one of the two coordinates of the card’s node in the tree; drives strict per\-tier visibility \(see below\)\. ES is intentionally card\-free\.
- •domain\_tags\(list\[str\]\): the other coordinate of the card’s node in the tree; multi\-label, intersected with the active domain at evaluation\. The literal taguniversalpasses any domain\.
- •helpfulness\_score\(float\): per\-card score updated from paired train\-verify lift measurements during offline curation\.
- •provenance: - –source\(str\): e\.g\.,teacher\_distillationorwinning\_trace:<pid\>\. - –supporting\_problems\(list\[str\]\): training problems on which lift was measured\. - –validated\_lift\(str\): a free\-form lift annotation, e\.g\."\+2/6 \-\> \+4/6 on P026 \(k=6\)"\. - –promotion\_status∈\{experimental,validated,deprecated\}\\in\\\{\\text\{experimental\},\\text\{validated\},\\text\{deprecated\}\\\}\. - –n\_uses,n\_wins,n\_losses\(int\): usage and paired\-impact counters\. - –epoch\_introduced\(int\): curation\-loop epoch in which the card first entered𝒦\\mathcal\{K\}\.

#### Strict per\-tier visibility\.

The pipeline reads cards by the \(tier,difficulty\_tag\) pair using a strict, non\-cumulative table:

This non\-cumulative design is a deliberate departure from earlier versions, which made hard cards visible to medium solvers as well and allowed learned cards at ES\. Strict visibility keeps each node of the tree coherent in difficulty register and avoids leaking hard rubrics into MS, where they were observed to over\-anchor solver attempts\. ES remains card\-free because ES cards increased early exits but also increased wrong early exits\.

#### JSON form\.

Each card is stored in a thread\-safe JSON\-backed store\. A serialized card has the form:

```
{
  "card_id": "...",
  "payload": "...",
  "routing_conditions": ["...", "..."],
  "difficulty_tag": "medium" | "hard" | "universal",
  "domain_tags": ["..."],
  "helpfulness_score": 0.0,
  "provenance": {
    "source": "...",
    "supporting_problems": ["..."],
    "validated_lift": "...",
    "promotion_status": "experimental" | "validated" | "deprecated",
    "n_uses": 0,
    "n_wins": 0,
    "n_losses": 0,
    "epoch_introduced": -1
  }
}
```

#### Legacy fields\.

For backward compatibility with earlier serialized stores, three legacy fields \(scope,tier\_eligibility,tag\) may still appear on disk; the current pipeline ignores them and migrates legacyscopevalues intodomain\_tagson load\.

## Appendix DExample Learned Cards

Cards are distilled exclusively from training\-split traces \(§[3\.1](https://arxiv.org/html/2606.00532#S3.SS1)\); test\-split problems were never inspected during card construction\. To make the schema concrete, we show two cards distilled byKACEduring training on AIME\-style training problems and frozen for evaluation\. The first targets a hard\-geometry problem family; the second targets a hard\-combinatorics family\. Both cards were promoted tovalidatedstatus after positive paired train\-verify lift, and both live at the \(hard, geometry\) and \(hard, combinatorics\) nodes of the tree respectively\. The fields shown below are the literal contents of the card record \(the on\-disk field names are preserved verbatim\), flattened into the display read into the active prompt when the matching node is consulted\.

### Card 1:ALGO\_RUBRIC\_GEO\_3D\_TILTED\_SOLID\_FILL

```
card_id:        ALGO_RUBRIC_GEO_3D_TILTED_SOLID_FILL
shape:          algorithmic rubric
difficulty_tag: hard
domain_tags:    [geometry]
tier_visible:   HS
source:         meta_learner_4_24_epoch_1_rework
provenance:     supporting train problem P012; promotion_status = validated

routing_conditions (selection criteria):
  - vertex on plane
  - cube balanced
  - water level
  - horizontal plane
  - volume of water
  - tilted cube
  - body diagonal
routing_conditions (exclusion clauses):
  - solid is axis-aligned (not tilted)
  - problem asks about surface area rather than volume

payload (6-step rubric):
  1. Identify symmetry axis (body diagonal of the tilted solid).
  2. Compute vertex heights as dot products onto the vertical axis.
  3. Solve edge length first, before solving the fill-height question.
  4. Classify the water cross-section by which vertices fall below
     the fill height.
  5. Integrate, or decompose the submerged region into sub-pyramids
     and sum their volumes.
  6. Reduce the answer to p/q in lowest terms and output p + q.
```

This card was distilled for the “tilted cube with water” problem family that recurs on hard\-geometry AIME items\. It lives at the \(hard, geometry\) node of the tree, so it is read into the active prompt whenever the test\-time projection lands a problem at that node — that is, whenever the tiered pipeline escalates to HS and the coarse domain classifier \(or its post\-MS refinement\) tags the problem as geometry\.

### Card 2:MICRO\_GRID\_MAXIMAL\_ROW\_COL\_COLOR

```
card_id:        MICRO_GRID_MAXIMAL_ROW_COL_COLOR
shape:          micro rubric
difficulty_tag: hard
domain_tags:    [combinatorics]
tier_visible:   HS
source:         train_add_4_25_v2_epoch_1
provenance:     supporting train problem P042; promotion_status = validated

routing_conditions (selection criteria):
  - n-by-n grid with chips
  - all chips in same row are same color
  - all chips in same column are same color
  - maximal placement, no additional chip can be placed
  - two-color row-column constraint
routing_conditions (exclusion clauses):
  - chip-counting problem without a maximality clause
  - single-color grids (no row/column color partition)

payload (4-step rubric):
  1. Classify rows and columns into states {W, B, E}
     (white-only, black-only, empty).
  2. Enforce color-agreement at every filled cell:
     W-row x W-col and B-row x B-col only.
  3. Enforce maximality: every empty cell must be FORCED by a row/col
     color clash; every E-row must intersect both W- and B-columns.
  4. Count admissible (#W-rows, #B-rows, #W-cols, #B-cols) tuples
     and multiply the corresponding binomials.
```

This card was distilled for the “n×nn\\times ngrid, two\-color chips, maximal placement” problem family — a recurring family in olympiad combinatorics\. The card was distilled exclusively from training\-split traces and was not constructed with reference to any test\-set problem\. It lives at the \(hard, combinatorics\) node of the tree\. The four\-step rubric encodes the maximality reduction that the empty\-KB hard solver consistently missed on training problems of this shape\.

#### Reading these cards\.

A card is read into the active prompt exactly when the test\-time projection lands on its \(difficulty, domain\) node: the difficulty tier is supplied by the tiered self\-consistency pipeline \(a card atdifficulty\_tag = hardis reached only on escalation to HS\), and the domain is supplied by the coarse classifier intersected with the card’sdomain\_tags\. The card’s content is its applicability claim; its position in the tree is what determines whether that claim is consulted\.

## Appendix EAgent Prompts

This appendix reproduces verbatim the prompts used by theKACE All\-at\-once Learner\(Algorithm[1](https://arxiv.org/html/2606.00532#alg1)\) and by the test\-time pipeline\. Each subsection is titled by the smallcaps function name as it appears in Algorithm[1](https://arxiv.org/html/2606.00532#alg1); trivial template tokens \(problem text, candidate card lists, fire statistics\) are denoted<placeholder\>\. The prompts below are drawn from the open\-source release accompanying the paper\.

### E\.1BuildEpistemicTree

#### Source\.

meta\_learning/prompts/domain\_tree\.md\.

```
SYSTEM:
You are the DOMAIN PARTITIONER for a competition-math reasoning system named
KACE. KACE has a knowledge bank of intervention cards organized by domain
(e.g. "geometry", "algebra"). At inference time the system classifies each
problem into a single domain and the solver only sees cards from that domain
(plus universal cards). Therefore the domain partition controls how much
irrelevant context the solver has to wade through -- too few coarse domains
= bloated card sets; too many narrow domains = solver never sees the right
card.

Your job: read a corpus of math problems and propose a disjoint partition
into 3-7 domains that minimizes context bloat per problem.

Partition rules:
1. Number of domains: between 3 and 7.
2. Each domain holds >=10% of the problems.
3. Disjoint by classification: a typical reader can assign each problem to
   exactly one domain based only on the problem statement.
4. Cluster by technique family, not surface topic.
5. Include a universal slot for cross-domain sanity checks (not counted
   toward the 3-7).

Output strict JSON:
{
  "rationale": "1-3 sentences",
  "domains": [
    {
      "name": "snake_case_short_name",
      "description": "1-line description",
      "size_pct": 25,
      "membership_signals": ["short phrase 1", "short phrase 2", ...]
    }
    /* ... 3-7 entries ... */
  ]
}

USER (template):
PROBLEM CORPUS (N problems sampled from training set):

[1] <problem 1 text>
EXPECTED: <answer 1>
[2] <problem 2 text>
EXPECTED: <answer 2>
... up to ~30 problems ...

Propose the partition as JSON (no prose outside JSON).
```

### E\.2FindKnowledge

The Phase\-1 per\-cell sweep \(FindKnowledge\) performs both pattern search and card distillation in a single prompt; the orchestrator dispatches one call per learned\-card cell\.

#### Source\.

meta\_learning/prompts/propose\_per\_domain\.md\.

```
SYSTEM:
You are the CARD PROPOSER for KACE. KACE has a 3-tier solver
(Easy / Medium / Hard) and a knowledge bank of intervention cards.
Each card is a short, reusable rule, formula, or rubric that the
solver reads at inference time to avoid a known failure mode.

Your job: read the wrong-attempt traces of problems in a single
domain and propose new cards that, if shown to the solver next time,
would prevent the first concrete mistake in those traces.

Card schema (strict):
{
  "card_id": "PREFIX_DOMAIN_TECHNIQUE_KEYWORD",
  "payload": "Useful when:\n- pattern1 (<=5 words)\n- pattern2\n- pattern3\nDo not apply if:\n- exclusion1 (<=5 words)\n\n<body>",
  "difficulty_tag": "medium" | "hard" | "universal",
  "domain_tags": ["<domain>"]
}

Difficulty tag rule:
- medium: short formula / identity / single-pass procedure (<=6 lines)
- hard: multi-step rubric (<=12 lines)
- universal: applies across all domains (use sparingly)

Card-ID style (ALL_CAPS_SNAKE_CASE, 3-6 tokens):
- EXACT_*       -- formula or identity
- RUBRIC_* / ALGO_RUBRIC_*  -- multi-step procedure
- READING_*     -- problem-misreading guard
- CASE_SPLIT_*  -- case-completeness reminder
- ANTI_*        -- meta-rule

Hard rules (self-reject any card that violates):
1. Answer leakage: never quote a numeric answer from any problem.
2. Instance overfit: replace problem-specific names with abstract analogs.
3. Benchmark neutrality: no mentions of "AIME", "USAMO", "[0,999]",
   "mod 1000", or any benchmark-specific format.
4. No vague exhortations.
5. Body length: medium <=6 non-blank lines; hard <=12; universal <=4.
6. Domain tag must match the input bucket’s domain unless universal.

Volume guidance:
For each problem propose 1-3 cards. Default to 2.

Output strict JSON:
{ "domain": "<input domain>",
  "n_problems_addressed": <int>,
  "cards": [ /* array of cards */ ] }
Return ONLY this JSON.

USER (template):
DOMAIN: <domain>
EXISTING KB CARDS (read for format reference; do NOT duplicate):
[abbreviated list of card_ids and their first-line trigger]

PROBLEMS WITH WRONG ATTEMPTS (N entries):
[1] problem_id=P0XX
    PROBLEM: <text>
    EXPECTED: <answer>
    WRONG ATTEMPTS:
      - tier=MS, idx=0, wrong_ans=...
        excerpt: <last 1500 chars of trace>
      - ...
    EXIT: <es_unanimous|ms_majority|hs_plurality|...>
    CARDS SHOWN MS: [...]
    CARDS SHOWN HS: [...]
[2] ...

Propose new cards as JSON.
```

### E\.3ReflectAndReviseCards,Refine, andGateJudge

The Phase\-2 per\-card revision \(ReflectAndReviseCards\), theRefinebranch that self\-corrects over\-large edits, and the leak/quality/tier\-coherence checks ofGateJudgeare folded into a single curator prompt;GateJudge’s constraints appear inside the strict\-card\-schema block reused from the proposer above \(answer\-leakage, instance\-overfit, benchmark\-neutrality\), andRefineis operationalized as the EDIT branch of the curator’s KEEP/EDIT/DEPRECATE decision\.

#### Source\.

meta\_learning/prompts/curate\_per\_domain\.md\.

```
SYSTEM:
You are the CURATOR for KACE’s intervention cards in a single domain.
You will see (a) every card in this domain, (b) per-card fire statistics
from a recent train run, and (c) the training problems in this domain
that were still wrong after that run. Validation outcomes are NEVER
shown to you and never feed into the curation decision. Your job is to
clean up the bucket so the next run benefits from a tighter, more
targeted KB.

Decision options for each card:
- KEEP: card fired >=2 times AND shown_correct >= shown_wrong / 2.
- EDIT: card content sound but trigger header too vague or too narrow.
        Rewrite the trigger header (and optionally the body); each
        trigger bullet <=5 words; stay within tier line caps.
- DEPRECATE: drop the card entirely. Use when:
  - the card never fired and its content is too narrow to ever apply
    outside the original problem,
  - it is redundant with another card in the same domain
    (note which one in ‘reason‘),
  - it consistently fires on wrong problems with
    shown_wrong > shown_correct + 1,
  - the body contains a wrong/misleading rule.

Decision priors (use to break ties):
- For cards with n_shown == 0: prefer EDIT if technique-rich;
  prefer DEPRECATE if problem-specific recipe.
- For cards with shown_correct >> shown_wrong: KEEP.
- For cards introduced in this epoch with n_shown == 0: prefer DEPRECATE.

No new cards in Phase 2:
You MUST NOT propose any new cards in this phase. Phase 2 is strictly
edit-only over the existing card set; new-card distillation happens in
Phase 1. If the wrong-problems list reveals a failure mode that no
existing card can be edited to cover, mark the affected cards
DEPRECATE and surface the gap in the ‘reason‘ field; the next Phase 1
will pick it up.

Output strict JSON:
{
  "domain": "<input domain>",
  "decisions": [
    {"card_id": "...", "action": "KEEP"},
    {"card_id": "...", "action": "EDIT",
     "new_payload": "...", "reason": "..."},
    {"card_id": "...", "action": "DEPRECATE", "reason": "..."}
  ]
}
Return ONLY this JSON.

USER (template):
DOMAIN: <domain>

EXISTING CARDS (N entries with fire stats):
[card_id]
  difficulty: <tag>, domains: [...]
  payload: <full payload>
  fire stats: n_shown=N, shown_correct=C, shown_wrong=W,
              tier_breakdown={MS:.., HS:..}
  shown_on_problems: [...]

WRONG PROBLEMS IN THIS DOMAIN (M entries):
[1] problem_id=...
    PROBLEM: <text>
    EXPECTED: <answer>
    FINAL: <model answer>
    EXIT: <path>
    CARDS_SHOWN_MS: [...]
    CARDS_SHOWN_HS: [...]
[2] ...

Decide each card and return JSON.
```

### E\.4Coarse Domain Classifier \(test\-time\)

#### Source\.

kace/agents/domain\_classifier\.py,\_build\_system\_prompt\(\)\.

```
You are a math problem domain classifier.

Read the problem and pick the SINGLE best primary domain from:
{lines}
- "mixed"  -- only if the problem genuinely combines two domains as primary

Use the EXACT domain name (snake_case where shown). Downstream card filters
match on these exact names; deviations exclude relevant cards.

Respond with strict JSON: {"primary": <one of the listed domain names>}.
```

The list\{lines\}is dynamically expanded from the domain registry produced byBuildEpistemicTree\.

### E\.5Easy Solver \(ES\)

#### Source\.

kace/agents/easy\_solver\.py\.

```
SYSTEM:
You are a careful competition-math solver.

LENS FOR THIS ATTEMPT: {lens_prompt}

End your response with the final answer enclosed in \boxed{...}.

USER:
PROBLEM:
{problem_text}

Now solve the problem carefully.

Lenses (DIPPER 2-lens rotation across the 2 ES attempts):
- domain_patterns: Look for the underlying mathematical domain (number
    theory, combinatorics, geometry, algebra) and recall canonical
    techniques for problems of that shape. Name the technique before
    using it.
- computation_chain: Focus on the algebraic/arithmetic chain. Derive
    symbolically, avoid numerical approximation, and verify each step
    before continuing.
```

### E\.6Medium Solver \(MS\)

#### Source\.

kace/agents/medium\_solver\.py\.

```
SYSTEM:
You are a careful competition-math solver. You receive a small set of
MEDIUM-tier guidance cards selected for this problem.

Use the cards where their premise fits. Quote any formula or rule
explicitly when you use it.

LENS FOR THIS ATTEMPT: {lens_prompt}

End your response with the final answer enclosed in \boxed{...}.

{card_block}

USER:
PROBLEM:
{problem_text}

Now solve the problem carefully, applying any relevant card.

Lenses (DIPPER 3-lens rotation across the 3 MS attempts):
- domain_patterns:    Look for the underlying mathematical domain and
                      recall canonical techniques.
- computation_chain:  Focus on the algebraic chain. Derive symbolically
                      and verify each step.
- problem_setup:      Read the problem precisely. State every condition
                      and constraint cleanly.
```

### E\.7Hard Solver \(HS\)

#### Source\.

kace/agents/hard\_solver\.py\.

```
SYSTEM:
You are a careful competition-math solver attacking a HARD problem.
Prior tiers (Easy, Medium) did not converge.

Apply the HARD-tier guidance cards where their premise fits. Quote any
rule or rubric step you follow. If a card is an algorithmic rubric,
walk through its steps.

End your response with the final answer enclosed in \boxed{...}.

{card_block}

USER:
PROBLEM:
{problem_text}

Now solve the problem carefully, applying the hard cards above.
```

## Appendix FImplementation Details

This appendix records the exact per\-tier configuration used for the main results \(Table[7](https://arxiv.org/html/2606.00532#A6.T7)\), the measured per\-problem solver\-call counts that justify the compute claims of §[3\.3](https://arxiv.org/html/2606.00532#S3.SS3)\(Table[8](https://arxiv.org/html/2606.00532#A6.T8)\), and the size and accuracy of the best frozen knowledge\-base snapshot per dataset \(Table[9](https://arxiv.org/html/2606.00532#A6.T9)\)\.

### F\.1Per\-Tier Configuration

Table 7:Per\-tier configuration used for all main\-table runs\. Output\-token caps differ by dataset: the MATH\-HARD run reuses a22k cap from an earlier sweep, while OlymMATH\-EN and AIME25 use the default66k/1212k/1212k caps\. The hard\-solver temperature isT=0\.8T\{=\}0\.8across all three datasets to encourage sampling diversity at the tier where coverage rather than agreement is the binding constraint; ES and MS solver temperatures areT=0\.6T\{=\}0\.6\. The coarse domain classifier is deterministic\.Table[7](https://arxiv.org/html/2606.00532#A6.T7)gives the exact knobs of the locked pipeline\. The split between solver temperature \(T=0\.6T\{=\}0\.6or0\.80\.8\) and the deterministic decoding \(T=0T\{=\}0\) of the coarse domain classifier reflects different roles: solvers benefit from sampling diversity inside a self\-consistency vote, whereas the domain projection that selects the active node of the tree must reason deterministically so that the cards reaching the solver are reproducible across reruns\.

### F\.2Solver Calls per Problem

Table 8:Average solver calls per problem \(ES \+ MS \+ HS only; routers and domain classifiers excluded\)\. Tiered baseline budgets: ES = 2, MS = 3, HS = 5 \(floor 2\.0, ceiling 10\.0\)\. 3\-seed averages where available\.Table[8](https://arxiv.org/html/2606.00532#A6.T8)grounds the comparable\-compute claim made in §[3\.3](https://arxiv.org/html/2606.00532#S3.SS3)\. MATH\-HARD is dominated by ES exits and averages well below the Best\-of\-55budget; OlymMATH\-EN and AIME 2025 escalate more frequently to MS and HS and average slightly above55calls\. In all cases the budget remains close to fixed Best\-of\-55because no problem can exceed the2\+3\+5=102\{\+\}3\{\+\}5=10budget and many exit early at ES or MS\.

### F\.3Knowledge\-Base Size per Dataset

Table 9:Best frozen epistemic tree per dataset\. “Total KB tokens” is the maximum context length if every card in𝒦\\mathcal\{K\}were shown to the solver simultaneously; in practice the test\-time projection lands on a single \(difficulty, domain\) node and only the cards under that node reach the solver, on the order of10001000–50005000tokens per problem on average\. The tree itself is therefore not constrained by the solver’s active context window\.Table[9](https://arxiv.org/html/2606.00532#A6.T9)reports the size of the frozen epistemic tree used at evaluation\. The numbers make explicit a property that is central to the storage\-vs\-injection separation argument of §[2\.1](https://arxiv.org/html/2606.00532#S2.SS1): although the durable tree carries thousands of tokens of distilled knowledge, only the small slice under the active \(difficulty, domain\) node \(∼\\sim1000–5000 tokens\) reaches the solver on escalated MS/HS problems\. This is the operational form of decoupling capacity from active context size, and it is what allowsKACEto grow the tree without paying a proportional active\-context tax\. Test accuracy entries report the mean headline numbers in Table[5](https://arxiv.org/html/2606.00532#S3.T5); all three benchmarks are reported as33\-seed averages\.

### F\.4Compute Accounting

All reported experiments use hosted API models; no local GPU training or weight updates are performed\. Inference\-time compute is reported in two ways: Table[7](https://arxiv.org/html/2606.00532#A6.T7)gives the per\-tier output\-token caps and Table[8](https://arxiv.org/html/2606.00532#A6.T8)reports measured average solver calls per problem on the test split\. The releasedKACEartifact reproduces the AIME 2025 result from the frozen 132\-card KB usingscripts/reproduce\_paper\.py; its default runner uses API parallelism rather than local accelerator compute\.

The offline construction loop consumes Claude Sonnet 4\.6 calls through the Anthropic Agent SDK for domain construction, card proposal, and curation, and consumes OpenAI API calls when the train/validation runner evaluates candidate trees\. Table[10](https://arxiv.org/html/2606.00532#A6.T10)reports the Sonnet\-side offline learner token budget and dollar cost for the two\-epoch construction setting\. Table[11](https://arxiv.org/html/2606.00532#A6.T11)reports the solver\-side gpt\-4\.1\-mini cost for the corresponding AIME run\. The compute comparison supporting the main empirical claims should still be read as an inference\-time solver\-call comparison; the tables below account for the offline learner and solver costs that build and evaluate the released card bank\.

Table 10:Offline learner cost on Sonnet 4\.6 with Anthropic prompt caching: per\-epoch input/output token split and 2\-epoch totals for the 132\-card AIME output\. Cache write=$​3\.75=\\mathdollar 3\.75/MTok \(1\.25×1\.25\\timesbase\), cache read=$​0\.30=\\mathdollar 0\.30/MTok \(0\.1×0\.1\\times\), base input=$​3=\\mathdollar 3/MTok, output=$​15=\\mathdollar 15/MTok\.
Table 11:Solver\-side cost on gpt\-4\.1\-mini for a 2\-epochKACErun on AIME \(train 45, validation 45, test 30; solver calls only, with routers and domain classifiers excluded; pricing$​0\.40\\mathdollar 0\.40/$​0\.10\\mathdollar 0\.10/$​1\.60\\mathdollar 1\.60per MTok input/cached/output\)\.Solver\-side cost is smaller than the cached offline learner cost on Sonnet 4\.6 \(Table[10](https://arxiv.org/html/2606.00532#A6.T10)\)\. The dominant solver\-side driver is output tokens at the MS and HS tiers\.

## Appendix GTier\-Exit Concordance with Empirical Difficulty

Headline numbers appear in Table[4](https://arxiv.org/html/2606.00532#S3.T4)\(§[3\.4](https://arxiv.org/html/2606.00532#S3.SS4)\); this appendix gives the per\-bucket breakdown and discusses the dominant failure mode\. This appendix evaluates how well the locked ES/MS/HS tier\-exit ordering preserves the model’s own empirical difficulty ordering of problems\. The evaluation is purely model\-derived: difficulty is defined by per\-problem solve rate under independent sampling, with no human\-supplied difficulty labels \(e\.g\., AMC/AIME problem position, MATH curriculum level\) entering the comparison\.

#### Empirical difficulty\.

For every test problemiiwe run the locked tiered pipeline acrossS=3S=3independent random seeds; for MATH\-HARD this uses the KACE iter\-4 KB reported below\. Each tier of the locked pipeline produces independent attempts \(ES=2=2, MS=3=3, HS=5=5\); problems that exit early have fewer total attempts than those that escalate\. LetTiT\_\{i\}be the total number of attempts pooled across theSSseeds \(ranging from66to3030\) andCiC\_\{i\}the number that produced the correct answer\. The empirical difficulty of problemiiis

raw​\_​ratei=Ci/Ti∈\[0,1\],\\mathrm\{raw\\\_rate\}\_\{i\}\\;=\\;C\_\{i\}/T\_\{i\}\\;\\in\\;\[0,1\],the fraction of independent samples that answer the problem correctly\.

#### Predictor\.

For each \(problem, seed\) we map the gate’s exit path to an ordinal exit rank: higher rank means the gate*predicts*an easier problem\.

Table 12:Ordinal exit\-rank predictor used in the concordance metric below\.For each problem the exit\-rank predictor iseie\_\{i\}, the mean rank across theSSseeds\.

#### Concordance metric\.

We report a pairwise concordance score \(equivalent to AUROC on the binary “easier\-than” lift\)\. Letri=raw​\_​rateir\_\{i\}=\\mathrm\{raw\\\_rate\}\_\{i\}andeie\_\{i\}the predicted lift, and writeΔ​ri​j=ri−rj\\Delta r\_\{ij\}=r\_\{i\}\-r\_\{j\},Δ​ei​j=ei−ej\\Delta e\_\{ij\}=e\_\{i\}\-e\_\{j\}\. Over ordered pairsi<ji<jwithΔ​ri​j≠0\\Delta r\_\{ij\}\\neq 0:

ρpair=∑i<jΔ​ri​j≠0\[𝟙​\{Δ​ri​j​Δ​ei​j\>0\}\+12​1​\{Δ​ei​j=0\}\]\#​\{i<j:Δ​ri​j≠0\}\.\\rho\_\{\\text\{pair\}\}\\;=\\;\\frac\{\\displaystyle\\sum\_\{\\begin\{subarray\}\{c\}i<j\\\\ \\Delta r\_\{ij\}\\neq 0\\end\{subarray\}\}\\Big\[\\mathbb\{1\}\\\{\\Delta r\_\{ij\}\\,\\Delta e\_\{ij\}\>0\\\}\\;\+\\;\\tfrac\{1\}\{2\}\\,\\mathbb\{1\}\\\{\\Delta e\_\{ij\}=0\\\}\\Big\]\}\{\\\#\\\{i<j:\\Delta r\_\{ij\}\\neq 0\\\}\}\.A value of1\.01\.0indicates perfect ordinal preservation,0\.50\.5is the random baseline, and0\.00\.0is fully inverted\. We also report Spearman’sρ\\rhoon the pairs\(ri,ei\)\(r\_\{i\},e\_\{i\}\)for completeness\.

### Results — AIME 2025 \(n=30n\{=\}30,S=3S\{=\}3,gpt\-4\.1\-mini\)

Table 13:AIME 2025 tier\-exit cells per difficulty bucket\. “cells” is the per\-bucket total of \(problem, seed\) pairs \(\#​probs×S\\\#\\text\{probs\}\\times S\)\.Spearman’sρ=0\.733\\rho=0\.733\. Pairwise concordanceρpair=302/386=78%\\rho\_\{\\text\{pair\}\}=302/386=78\\%\.
### Results — MATH\-HARD \(n=50n\{=\}50,S=3S\{=\}3,gpt\-4o\-mini, KACE iter\-4 KB\)

Table 14:MATH\-HARD tier\-exit cells per difficulty bucket\.Spearman’sρ=0\.816\\rho=0\.816\. Pairwise concordanceρpair=743/923=80%\\rho\_\{\\text\{pair\}\}=743/923=80\\%\.#### Reading\.

KACE’s tier\-exit hierarchy preserves the empirical difficulty ordering of78%78\\%of AIME pairs and80%80\\%of MATH\-HARD pairs — substantially above the50%50\\%random baseline\. The dominant failure mode on both benchmarks is the VERY\-HARD bucket \(raw​\_​rate=0%\\mathrm\{raw\\\_rate\}=0\\%\): problems that the model never solves but on which the gate*convergently exits early*, because all attempts agree on a wrong answer\. On AIME 2025,5\+10=155\+10=15of2424VERY\-HARD cells exit at ES or MS; on MATH\-HARD,7\+12=197\+12=19of2121VERY\-HARD cells do the same\. These are the cases where the gate has no signal to detect difficulty — agreement on a wrong answer is indistinguishable from agreement on a right answer at the level of self\-consistency — and they account for nearly all the discordant pairs\. The same pathology surfaces in fullKACEas wrongly selected cards on overconfident wrong answers, which is why calibration remains diagnostic rather than a closed loop\. Without a step\-level verifier or a learned judge, agreement is the strongest signal available, and agreement underestimates hardness in exactly the regime where harder reasoning is most needed\.

## Appendix HSynthesis Experiments

We investigated whether*context folding*across hard\-solver attempts could be used to lift the hard tier without spending extra solver calls\. The procedure was: after the HS pool of five attempts disagreed below the2/52/5exit gate, the teacher LLM read the five partial trajectories, extracted a*stable common prefix*of intermediate facts that the attempts agreed on \(the “stable point”\), summarized that prefix into a compact context, and asked the solver to finish the solution from that synthesized state\. The intent was to recover a66th attempt whose initialization conditioned on the agreed structure of the previous five rather than starting from the problem alone\.

The empirical result was that synthesis produced essentially no change in accuracy, and the small movement that did appear was dominated by where the stable\-prefix cut was placed\. Cutting too early forfeited useful intermediate work that the attempts had already agreed on; cutting too late propagated whichever line of attack had drifted off\-course in the majority of attempts, and the synthesized completion inherited the same error\. Crucially, there was no reliable way to ascertain how much of each HS attempt’s intermediate reasoning was actually correct: agreement across attempts is necessary but not sufficient for correctness, and without a step\-level verifier the teacher could only summarize what*looked*stable, not what was*known*stable\. This is the same bottleneck that limits self\-refinement methods more generally: a refinement step is only as good as the refiner’s ability to localize the actual point of divergence, and on hard math reasoning that localization is itself the unsolved subproblem\. We therefore disable synthesis in the main pipeline and report it here only to record the failure mode\.

## Appendix IImagination Experiments

We also experimented with expanding the knowledge base through*imagination*— generating cards that were not directly grounded in observed training failures, but were proposed by the teacher LLM as plausibly missing knowledge\. Concretely, for each existing cardcc, the teacher was prompted along the following lines:*“Suppose a base LLM is already using the errors and sanity\-checks captured by this card but still gets a problem in this domain wrong\. What else might it not know that an expert would consider obvious?”*The teacher proposed neighbor cards under this prompt, which were then put through the standard gate\-judge admission and consolidation steps\. Iterating this procedure expanded the KB from the curated 132\-card AIME tree used in the main results to 450 cards\.

Table 15:Imagination expansion of the KB\. Imagined cards unlocked new solves on the validation split, but additional refinement rounds drove validation accuracy up while test accuracy plateaued and then drifted down — the canonical signature of overfitting to a small validation set\.The expanded KB unlocked new solves on the validation split: problems that the curated KB could not solve found a useful imagined card and were solved\. However, two structural issues surfaced \(Table[15](https://arxiv.org/html/2606.00532#A9.T15)\)\. First, the*quality*of imagined cards became very hard to check: gate\-judging filters for ground\-truth leakage and surface quality, but neither check distinguishes a genuinely missing piece of expert knowledge from a plausible\-sounding restatement of something the model could already produce\. Without a downstream verifier strong enough to test whether an imagined card encodes content the solver*actually*lacks, lift on the validation split is the only available signal\. Second, this validation\-only signal was quickly exhausted: after a small number of refinement rounds the imagination loop had effectively memorized the validation residual, and further rounds raised validation accuracy without raising test accuracy — the canonical signature of overfitting to a small evaluation set\.

We therefore did not include imagined cards in the mainKACEresults\. The episode is informative on its own: it suggests that scaling the KB beyond what training\-trace failures can ground requires a verifier that operates above the level of paired train\-verify lift, and that without such a verifier additional KB capacity is not free — it converts straightforwardly into validation overfitting\. We treat this as a precursor to a judge\-in\-the\-loop extension\.

Similar Articles

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

arXiv cs.AI

This paper presents KITE, a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system for algorithmic reasoning and problem-solving in AI education. The system uses intent-aware Socratic response strategies and multimodal RAG to provide course-grounded, pedagogically appropriate feedback, and is evaluated through metrics, expert review, and simulated student interactions.

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

arXiv cs.CL

This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

arXiv cs.AI

R-APS (Reflective Adversarial Pareto Search) is a novel method for constrained design tasks that addresses three structural failures in LLM-based agentic systems—error propagation, robustness evaluation, and knowledge invalidation—through reasoning-mode decomposition across three timescales, requiring no fine-tuning. Evaluated on planar mechanism synthesis, it achieves 3.5x tighter robustness certificates, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over baselines.