Inducing Reasoning Primitives from Agent Traces
Summary
Introduces Reasoning Primitive Induction, a method that mines successful ReAct traces to cluster recurrent reasoning moves into typed pseudo-tools, outperforming the original agent by tens of percentage points on benchmarks.
View Cached Full Text
Cached at: 06/03/26, 09:42 AM
# Inducing Reasoning Primitives from Agent Traces
Source: [https://arxiv.org/html/2606.02994](https://arxiv.org/html/2606.02994)
###### Abstract
ReAct\-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads\. We introduce*Reasoning Primitive Induction*, a single\-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo\-tools\. Each pseudo\-tool is specified by a natural\-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time\. The central result is that induced libraries outperform the very agent that generated their traces: by\+44\+44pp on RuleArena NBA \(30→7430\\to 74\),\+30\+30pp onMuSRteam allocation \(38→6838\\to 68\), and\+22\+22pp on NatPlan meeting planning \(7→297\\to 29\)\. Across five comparable subtasks spanning narrative deduction, rule application, and constraint\-satisfaction planning, a single fixed configuration improves over zero\-shot Chain\-of\-Thought on every subtask, matches or surpasses expert\-authored decompositions, and outperforms AWM at lower average inference cost\.
## 1Introduction
Open\-ended LLM agents—ReActYao et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib2)\)and its descendants—reconstruct multi\-step reasoning from scratch on every instance\. Across a task family, however, their traces often contain the same latent routines: inspect a suspect’s means, check an alibi against a timeline, weigh physical evidence, or arbitrate among competing candidates\. These moves are visible in aggregate but remain unstructured at deployment time: each rollout is a fresh mixture of free\-form thoughts, actions, and observations, and the recurring patterns are discarded after the problem is solved\. The standard alternative is to hand\-author a task\-specific decomposition, but doing so requires domain insight and a new engineering pass for each benchmark family\.
We ask whether these recurring reasoning moves can be recovered automatically from the agent’s own traces\. Prior trace\-induction methods \(§[2](https://arxiv.org/html/2606.02994#S2)\) often mine completed trajectories, workflow guides, or executable skills, making the induced artifact a more reusable form of behavior the source already exhibited\. We instead induce at a finer granularity: individual reasoning moves\. The induced artifact is not a workflow to follow or an executable skill to call but the agent’s reusable reasoning vocabulary—a typed inventory of the recurring moves the agent reaches for when reasoning across instances of a task family\. The resulting primitives are exposed as typed callable pseudo\-tools, and the same agent that produced the traces can use them to outperform its original ReAct policy by tens of percentage points\. In this setting, trace induction is not merely distillation; it can surface reasoning structure that the source agent generated inconsistently but failed to deploy reliably\.
Thought 0:“I need to analyze the narrative to determine who is the most likely murderer betweenIslaandRandy\. Let me search for key evidence about both suspects\.”Thought 13:“Let me search for more specific evidence aboutRandy’salibi andIsla’salibi:”Thought 24:“Based on my analysis:Evidence against Randy:seen at the motel the day Jimmy was killed; owns a revolver, ballistics matched; strong financial motive…”⋯\\cdots⋯\\cdots\(A\)ReActsuccess rolloutInitializing suspect analysisInvestigating specific evidence typesEvaluating individual suspect evidence⋯\\cdots⋯\\cdots\(B\) Categorize & countTop\-5 canonical categories \(\|ℐ\|=196\|\\mathcal\{I\}\|=196thoughts\):investigating evidence48initializing suspects38final conclusion29summarize progress27evaluating evidence24def investigate\_evidence\( narrative: str, focus: str \) \-\> str: """ Analyze a murder narrative for evidence about ‘focus’ \(suspect, weapon, location\)\. Extract and weigh means, motive, opportunity, alibi consistency, physical evidence\. Flag hard disqualifiers and conflicting witnesses\. … """\(C\) Synthesized primitive
Figure 1:Worked example of reasoning primitive induction onMuSRmurder\.*\(A\)*Three verbatimThoughtstrings from one successfulReActrollout\.*\(B\)*Each thought is mapped to a reasoning\-move label \(Algorithm[1](https://arxiv.org/html/2606.02994#alg1)\); the bar chart shows the top\-5 canonical categories\.*\(C\)*The top category is synthesized into a typed pseudo\-tool whose body is realized by an LLM at invocation time \(full library in Appendix[A](https://arxiv.org/html/2606.02994#A1)\)\.*Reasoning primitive induction*takes one corpus of ReAct rollouts, clusters per\-stepThoughtstrings into recurring reasoning moves, and synthesizes the most frequent moves into typed Python stubs whose behavior is specified by LLM\-interpreted docstrings\. At test time, the induced library and a singlefinishaction form the agent’s action space, and an otherwise standard ReAct loop decides which primitive to invoke and in what order\. The pipeline uses two free parameters \(KK,mm\), three LLM prompts, and one fixed configuration across all benchmarks\.
Induction exceeds its source through an implicit aggregation step: for each recurring move, synthesis sees a canonical category label and representative thoughts sampled from*successful*rollouts, and writes a corpus\-level specification of what the move should accomplish\. The original ReAct agent instead makes instance\-level decisions on the fly and repeatedly reinvents the same move under local context\. Invoking a stable specification at deployment time is therefore not equivalent to asking the source agent to reproduce the move from scratch; when the source agent is high\-variance, the gap can be large\.
#### Contributions\.
We organize findings along three axes\.
\(1\) Trace induction exceeds its source\.Induced libraries outperform the source agent that produced their traces by large margins:\+44\+44pp on RuleArena NBA \(30→7430\\to 74\),\+30\+30pp onMuSRteam allocation \(38→6838\\to 68\), and\+22\+22pp on NatPlan meeting planning \(7→297\\to 29\)\. All corresponding paired\-Δ\\Deltaconfidence intervals are strictly positive \(§[4](https://arxiv.org/html/2606.02994#S4)\)\. This shows that induced\-library quality need not be tied to the source agent’s realized test\-time policy, and separates our setting from trace\-induction methods that primarily repackage complete workflows or skillsWang et al\. \([2025a](https://arxiv.org/html/2606.02994#bib.bib3),[b](https://arxiv.org/html/2606.02994#bib.bib4)\); Zheng et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib41)\)\.
\(2\) Discovery matches or surpasses expert design\.Without per\-task authoring, induced libraries significantly surpass expert\-authored decompositions onMuSRteam allocation \(\+17\+17pp\) and NatPlan meeting planning \(\+15\+15pp\), and match them on the remaining comparable subtasks \(§[4](https://arxiv.org/html/2606.02994#S4), Figure[4](https://arxiv.org/html/2606.02994#S4.F4); bootstrap CIs in Appendix[G](https://arxiv.org/html/2606.02994#A7)\)\.
\(3\) Minimal supervision, no expert tool design\.The trace source is a ReAct agent equipped only with generic search/lookup tools: no benchmark\-specific action vocabulary and no domain\-aware decomposition\. This matters for reasoning domains, where there is no obvious deterministic analogue ofverify\_alibi\_consistency\(\)orcheck\_schedule\_conflict\(\)that a developer can hand\-code\. We recover such structure*post hoc*from generic\-agent traces and obtain expert\-design\-level accuracy with minimal supervision\.
## 2Related Work
#### Trace induction for agent skills\.
The closest line of work mines successful agent traces for reusable artifacts\. AWMWang et al\. \([2025a](https://arxiv.org/html/2606.02994#bib.bib3)\)extracts natural\-language workflow guides from web\-agent traces; ASIWang et al\. \([2025b](https://arxiv.org/html/2606.02994#bib.bib4)\)mines Python skills as action\-space programs; SkillWeaverZheng et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib41)\)synthesizes APIs from a strong web agent to improve weaker downstream agents\. AFlowZhang et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib5)\)searches over workflow graphs with MCTS; TroVEWang et al\. \([2024c](https://arxiv.org/html/2606.02994#bib.bib39)\)grows a Python toolbox during programmatic\-task solving; STIRShi et al\. \([2026](https://arxiv.org/html/2606.02994#bib.bib6)\)discovers latent primitives and injects them as activation\-level steering signals\. We differ in three respects: our primitives live in*reasoning space*rather than an environment\-grounded action space; induction is*single pass*, with no online iteration, MCTS, or activation steering; and the induced atoms are typed pseudo\-tools that the agent composes freely inside a ReAct loop rather than fixed workflows\. Empirically, these choices place us in a different regime: the induced library can outperform the source agent that generated its traces \(§[4\.1](https://arxiv.org/html/2606.02994#S4.SS1)\)\.
#### Reasoning\-structure prompting and library learning\.
Self\-DiscoverZhou et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib8)\), Decomposed PromptingKhot et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib11)\), Plan\-and\-SolveWang et al\. \([2023c](https://arxiv.org/html/2606.02994#bib.bib30)\), and Tree\-of\-ThoughtsYao et al\. \([2023b](https://arxiv.org/html/2606.02994#bib.bib31)\)impose reasoning structure through human\-authored modules or generic search skeletons; RLADQu et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib28)\)learns an abstraction generator via reinforcement learning\. Classical library learning compresses solved programs into reusable abstractions \(DreamCoderEllis et al\. \([2021](https://arxiv.org/html/2606.02994#bib.bib21)\), LILOGrand et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib22)\), StitchBowers et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib23)\)\), while LLM\-era tool creation reframes the problem as tool authoring or retrieval \(ToolformerSchick et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib38)\), PALGao et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib33)\), LATMCai et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib32)\), CRAFTYuan et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib40)\)\)\. The closest design\-pattern precedent is Code as PoliciesLiang et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib34)\), which executes synthesized Python deterministically\. Our pseudo\-tools instead bind a docstring to LLM dispatch, allowing the library to encode reasoning moves such asverify\_alibi\_consistencythat do not admit clean deterministic implementations\.
#### Test\-time compute, prompt optimization, and self\-improvement\.
Self\-ConsistencyWang et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib13)\), LATSZhou et al\. \([2024b](https://arxiv.org/html/2606.02994#bib.bib16)\), and Snell et al\.Snell et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib1)\)allocate test\-time compute through sampling, search, or aggregation; we induce the available reasoning moves from data\. DSPyKhattab et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib9)\)with MIPROv2Opsahl\-Ong et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib37)\)and GEPAAgrawal et al\. \([2026](https://arxiv.org/html/2606.02994#bib.bib10)\)optimize prompts inside a fixed pipeline; we discover the modules that define the action space\. ReflexionShinn et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib25)\), ExpeLZhao et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib26)\), and STaR\-style methodsZelikman et al\. \([2022](https://arxiv.org/html/2606.02994#bib.bib27)\); Hosseini et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib29)\); Kumar et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib12)\)retrain or self\-correct the policy, whereas we leave model weights unchanged\. PTPCohen & Cohen \([2024](https://arxiv.org/html/2606.02994#bib.bib18)\), SSRMLeng et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib19)\), Faithful CoTLyu et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib35)\), and ICALSarch et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib7)\)structure individual reasoning chains for auditability; we abstract structure*across many chains*into a typed reusable inventory\. Appendix[K](https://arxiv.org/html/2606.02994#A11)gives an extended comparison\.
## 3Method: Reasoning Primitive Induction
ReActAgentSuccessfulRolloutsExtractThoughtsCategorize\+ MergeTop\-KKSynthesizePrimitiveLibraryℒ\\mathcal\{L\}input filteringinductionoutput
Figure 2:Reasoning primitive induction\.The pipeline \(Algorithm[1](https://arxiv.org/html/2606.02994#alg1)\) consumes a corpus of ReAct rollouts and emits a libraryℒ\\mathcal\{L\}in four stages: filter for correctness, extract per\-stepThoughtstrings, categorize and merge them into canonical reasoning moves via two LLM calls, and synthesize the topKKas typed primitives\. The recipe has two free parameters \(KK,mm\), three LLM prompts, and one configuration unchanged across benchmarks\.We define the artifact the pipeline produces \(§[3\.1](https://arxiv.org/html/2606.02994#S3.SS1)\), give the induction procedure \(§[3\.2](https://arxiv.org/html/2606.02994#S3.SS2)\), and discuss the design choices that keep the recipe minimal \(§[3\.3](https://arxiv.org/html/2606.02994#S3.SS3)\)\.
### 3\.1Reasoning Primitives and Pseudo\-Tools
A*reasoning primitive*is a reusable reasoning move, represented as a tuplep=\(n,σ,d\)p=\(n,\\sigma,d\)containing a name, a typed input/output signature, and a natural\-language description\. The abstraction is implementation\-agnostic\. In this paper we instantiate primitives as*pseudo\-tools*: typed callables registered in a ReAct action space whose bodies contain no handwritten task logic\. At invocation, the runtime arguments and the docstringddare formatted into a single LLM prompt; the LLM’s response is parsed into the declared output type and returned to the agent\. The signatureσ\\sigmais fixed by the task\-family prompt \(e\.g\.,\(narrative,focus\)→str\(\\texttt\{narrative\},\\texttt\{focus\}\)\\to\\texttt\{str\}forMuSR\), soσ\\sigmaconstrains the agent’s calling interface—which arguments it must supply, what type to expect back—whileddcarries the task\-specific semantics\. This split is what makes the abstraction useful for reasoning: a primitive likeverify\_alibi\_consistencyorcheck\_schedule\_conflictcan be expressed entirely as a docstring without committing to an explicit Python implementation, even when the same operation has no clean deterministic counterpart\. Reasoning primitives are the conceptual objects; pseudo\-tools are the deployment mechanism evaluated here\.
#### Composition\.
A primitive library is a setℒ=\{p1,…,pK\}\\mathcal\{L\}=\\\{p\_\{1\},\\dots,p\_\{K\}\\\}with\|ℒ\|≤K\|\\mathcal\{L\}\|\\leq K\. To deploy it, we registerℒ∪\{finish\}\\mathcal\{L\}\\cup\\\{\\textsc\{finish\}\\\}as the agent’s action space and run a standard ReAct loop\. The library defines which reasoning moves are available; the agent still decides, instance by instance, which primitives to call and in what order\. We call this setting—induced primitives plusfinish, no other tools, and no primitive descriptions copied into the system prompt—the canonical configuration\. We ablate alternative deployments in Appendix[C](https://arxiv.org/html/2606.02994#A3)\.
### 3\.2Induction Pipeline
Let𝒯=\{t1,…,tN\}\\mathcal\{T\}=\\\{t\_\{1\},\\dots,t\_\{N\}\\\}be a corpus of ReAct rollouts on training instances, where each rolloutt=\(\(Th1,A1,O1\),…,\(ThL,AL,OL\)\)t=\(\(\\textit\{Th\}\_\{1\},A\_\{1\},O\_\{1\}\),\\dots,\(\\textit\{Th\}\_\{L\},A\_\{L\},O\_\{L\}\)\)is the agent’s interleaved sequence of thoughts, actions, and observations\. The goal is to induce a libraryℒ\\mathcal\{L\}with\|ℒ\|≤K\|\\mathcal\{L\}\|\\leq Kthat can be deployed asℒ∪\{finish\}\\mathcal\{L\}\\cup\\\{\\textsc\{finish\}\\\}on held\-out instances\. Algorithm[1](https://arxiv.org/html/2606.02994#alg1)proceeds in four stages\.
*Filter\.*We retain rollouts whose final answer matches the ground\-truth label,𝒯\+=\{t:Correct\(t\)\}\\mathcal\{T\}^\{\+\}=\\\{t:\\textsc\{Correct\}\(t\)\\\}; Appendix[C](https://arxiv.org/html/2606.02994#A3)ablates this choice\.*Extract\.*We discard actions and observations and keep only per\-stepThoughtstrings, producing a flat corpusℐ\\mathcal\{I\}of natural\-language reasoning sub\-steps\.*Categorize and merge\.*We label each thought with a short reasoning\-move name \(e\.g\., “verify alibi consistency”\) using one LLM call\. If the label set contains more than ten distinct values, a second LLM call clusters synonyms into 5–10 canonical categories\. This label\-level merge is the only redundancy\-reduction step\.*Top\-KKsynthesis\.*We sort canonical categories by frequency\. For each categoryccwith support at leastmm\(defaultm=3m\{=\}3\), we sample up to five representative thoughts \(uniform random, without replacement; all available if the category has fewer than five\) and prompt an LLM to emit a primitive name and docstring\(n,d\)\(n,d\)\. Synthesis stops onceKKprimitives are produced or no remaining category meets the support threshold\. The signatureσ\\sigmais fixed; onlynnandddare generated\.
#### Why this can yield a library that exceeds its source\.
Synthesis performs an implicit aggregation step: for each category, the LLM sees several representative thoughts from successful rollouts and writes one corpus\-level specification of the move\. The source agent, by contrast, reconstructs each move on the fly under per\-instance context—it may execute the move correctly on some instances and incorrectly on others, depending on the local thought trajectory\. Calling a stable named subroutine at deployment replaces this per\-instance reconstruction with a single specification that captures*what*the move should accomplish, independent of the idiosyncratic way the source agent happened to execute it in any one trace\. This is most visible in the\+44\+44pp NBA gap \(§[4](https://arxiv.org/html/2606.02994#S4)\), where the source ReAct agent has to retrieve and apply the right CBA rule against a∼\\sim5 KB rules excerpt loaded into every prompt step—a process whose reliability varies sharply across instances; the induced library encodes the rule\-application move once and invokes it consistently\.
Algorithm 1Reasoning Primitive Induction0:Training rollouts
𝒯=\{t1,…,tN\}\\mathcal\{T\}=\\\{t\_\{1\},\\dots,t\_\{N\}\\\}from a ReAct agent on a task family
ℱ\\mathcal\{F\}; library size
KK; minimum support
mm; fixed primitive signature
σ\\sigma
0:Reasoning\-primitive library
ℒ=\{p1,…,p\|ℒ\|\}\\mathcal\{L\}=\\\{p\_\{1\},\\dots,p\_\{\|\\mathcal\{L\}\|\}\\\}with
\|ℒ\|≤K\|\\mathcal\{L\}\|\\leq K
1:
𝒯\+←\{t∈𝒯:Correct\(t\)\}\\mathcal\{T\}^\{\+\}\\leftarrow\\\{t\\in\\mathcal\{T\}:\\textsc\{Correct\}\(t\)\\\}\{filter: retain rollouts whose final answer matches the label\}
2:
ℐ←ExtractThoughts\(𝒯\+\)\\mathcal\{I\}\\leftarrow\\textsc\{ExtractThoughts\}\(\\mathcal\{T\}^\{\+\}\)\{extract: drop actions and observations\}
3:
𝒞←\[LabelMove\(i\):i∈ℐ\]\\mathcal\{C\}\\leftarrow\[\\textsc\{LabelMove\}\(i\):i\\in\\mathcal\{I\}\]\{LLM: thought→\\to3–6\-word reasoning\-move label\}
4:if
\|Unique\(𝒞\)\|\>10\|\\textsc\{Unique\}\(\\mathcal\{C\}\)\|\>10then
5:
𝒞←MergeSynonyms\(𝒞\)\\mathcal\{C\}\\leftarrow\\textsc\{MergeSynonyms\}\(\\mathcal\{C\}\)\{LLM: cluster labels into 5–10 canonical categories\}
6:endif
7:
ℒ←∅\\mathcal\{L\}\\leftarrow\\emptyset
8:for
\(c,count\)∈MostCommon\(𝒞\)\(c,\\text\{count\}\)\\in\\textsc\{MostCommon\}\(\\mathcal\{C\}\)do
9:if
\|ℒ\|≥K\|\\mathcal\{L\}\|\\geq Kor
count<m\\text\{count\}<mthen
10:break
11:endif
12:
Ec←E\_\{c\}\\leftarrowup to five thoughts of category
cc\(uniform random, without replacement\)
13:
\(n,d\)←Synthesize\(c,Ec\)\(n,d\)\\leftarrow\\textsc\{Synthesize\}\(c,E\_\{c\}\)\{LLM: emit primitive namennand docstringdd; signatureσ\\sigmais fixed\}
14:
ℒ←ℒ∪\{\(n,σ,d\)\}\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\\cup\\\{\(n,\\sigma,d\)\\\}
15:endfor
16:return
ℒ\\mathcal\{L\}
### 3\.3Pipeline design choices
Four design choices keep the procedure deliberately minimal\. First, the trace\-source ReAct agent receives only generic search/lookup tools—no benchmark\-specific decomposition and no domain\-aware action vocabulary—so any recovered structure must come from the traces themselves\. Second, induction is single pass: ReAct runs once on the training split and the resulting rollouts are analyzed once, with no online iteration, curriculum, or iterative refactoring loop\. Third, we fix one configuration across benchmarks \(K=5K\{=\}5,m=3m\{=\}3, fixed prompts, with only a per\-benchmark task description varying\)\.K=5K\{=\}5is the cross\-benchmark default;K∈\{3,5,10\}K\\in\\\{3,5,10\\\}is swept in Appendix[C](https://arxiv.org/html/2606.02994#A3)\. We report this single fixed configuration rather than per\-task tuning\. Fourth, we fix the deployment variant to correct\-only trace filtering and invocation\-time descriptions \(*Variant A*\); the other corners of this factorial are evaluated in Appendix[C](https://arxiv.org/html/2606.02994#A3)\. The resulting recipe has two free parameters, three LLM prompts \(categorize / merge / synthesize\), and no benchmark\-specific code\.
## 4Main Results
Table[1](https://arxiv.org/html/2606.02994#S4.T1)reports held\-out test accuracy on six subtasks from three task families\. We compare*reasoning primitive induction*with five baselines designed to isolate different alternatives: zero\-shot prompting, the source ReAct agent, expert\-authored primitives, workflow\-memory induction, and executable program reasoning\. The experimental protocol appears in §[5](https://arxiv.org/html/2606.02994#S5); deployment and library\-size sweeps appear in Appendix[C](https://arxiv.org/html/2606.02994#A3)\.
Table 1:Cross\-benchmark main results\.Test\-set accuracy \(%\), DeepSeek\-V3\.Boldmarks the highest accuracy per column\. NatPlan trip is excluded from method comparisons because all methods score below10%10\\%\. 95% bootstrap CIs and paired\-Δ\\Deltatests appear in Appendix[G](https://arxiv.org/html/2606.02994#A7)\.### 4\.1Induction Exceeds Its Source
The clearest empirical signal is that induced libraries outperform the source agent that produced their traces \(Figure[3](https://arxiv.org/html/2606.02994#S4.F3)\)\. On RuleArena NBA, the source ReAct agent reaches30%30\\%accuracy, while a library induced from its traces reaches74%74\\%\(\+44\+44pp\)\. The corresponding gains are\+30\+30pp onMuSRteam allocation and\+22\+22pp on NatPlan meeting planning; all paired\-Δ\\Deltaconfidence intervals are strictly positive \(Appendix[G](https://arxiv.org/html/2606.02994#A7)\)\.
This is the regime in which*reasoning primitive induction*differs most sharply from workflow\- or skill\-level trace induction\. Methods such as AWMWang et al\. \([2025a](https://arxiv.org/html/2606.02994#bib.bib3)\)and ASIWang et al\. \([2025b](https://arxiv.org/html/2606.02994#bib.bib4)\)mine completed trajectories, workflow guides, or executable skills and reuse them at test time\. Our method instead targets individual reasoning moves\. Two mechanisms then compound: induction aggregates across many successful rollouts to produce a corpus\-level specification, and deployment replaces on\-the\-fly reconstruction with calls to a stable named subroutine\. When the source agent is high\-variance—as in NBA, where long rule excerpts make multi\-step reasoning brittle—this difference becomes large\.
0255075100accuracy \(%\)4575\+\+30murder5069\+\+19object3868\+\+30team3074\+𝟒𝟒\\boldsymbol\{\+44\}NBA729\+\+22meeting25\+\+3tripMuSRRuleArenaNatPlanReAct \(source agent\)Induced libraryFigure 3:Exceeding the source agent\.Test\-set accuracy of the sourceReActagent versus the induced library\. Induced libraries exceed their source by up to\+44\+44pp on RuleArena NBA,\+30\+30pp onMuSRteam allocation, and\+22\+22pp on NatPlan meeting \(all paired\-Δ\\DeltaCIs strictly positive; Appendix[G](https://arxiv.org/html/2606.02994#A7)\)\. NatPlan trip is excluded from method comparisons because all methods score below10%10\\%\.
### 4\.2Discovery matches or surpasses expert design
Induced libraries significantly surpass expert\-authored decompositions onMuSRteam allocation \(\+17\+17pp\) and NatPlan meeting planning \(\+15\+15pp\), and match them on the remaining cells; the murder, object, and NBA differences fall within 95% bootstrap confidence intervals \(Figure[4](https://arxiv.org/html/2606.02994#S4.F4); Appendix[G](https://arxiv.org/html/2606.02994#A7)\)\. Against AWM \(offline configuration; Appendix[J](https://arxiv.org/html/2606.02994#A10)\), induced libraries equal or exceed accuracy on every comparable cell and significantly improve four reported cells\. Against Program\-of\-Thoughts, they match accuracy on five of six cells and significantly improve meeting planning\. Against zero\-shot CoT, the gains are large and significant on three of the four cells where CoT is furthest from saturation \(murder\+15\+15, object\+29\+29, NBA\+26\+26pp\)\.
The hand\-designed baseline shares the typed\-pseudo\-tool decomposition with*reasoning primitive induction*but is deployed as a fixed Python pipeline \(Appendix[B](https://arxiv.org/html/2606.02994#A2)\) and embeds per\-task expert authoring by construction\.*Reasoning primitive induction*matches or surpasses this expert specification on every comparable cell, with the largest gains on the two soft\-constraint tasks where manual decomposition is hardest to enumerate\. On team allocation and meeting planning, the reasoning is not a linear parse→\\toplan→\\tocheck pipeline but a sequence of rearrangements between partial constraints, trade\-offs, and constraint violations—moves that an expert decomposition has to anticipate*a priori*, but that induction discovers directly from the traces of agents that solved the task successfully\. When the same hand\-designed primitives are instead registered as a ReAct action space \(matching*reasoning primitive induction*’s deployment\), induced libraries exceed them by\+1\+1to\+29\+29pp on every cell \(Appendix[B\.7](https://arxiv.org/html/2606.02994#A2.SS7)\); the discovery\-vs\-design gap is not driven by dispatch\.
0255075100accuracy \(%\)6975\+\+6murder5969\+\+10object5168\+𝟏𝟕\\boldsymbol\{\+17\}team7374\+\+2NBA1429\+𝟏𝟓\\boldsymbol\{\+15\}meeting45\+\+1tripMuSRRuleArenaNatPlanHand\-designedInduced libraryFigure 4:Discovery matches or surpasses expert design\.Test\-set accuracy of expert\-authored decompositions versus the induced library\. Induced libraries match or exceed the expert spec on every comparable cell; the team \(\+17\+17pp\) and meeting \(\+15\+15pp\) gains are statistically significant, while murder, object, and NBA are within 95% bootstrap CIs \(Appendix[G](https://arxiv.org/html/2606.02994#A7)\)\. Bold deltas indicate significance\.
### 4\.3Compute footprint and the more\-compute alternative
*Reasoning primitive induction*’s per\-instance compute is∼\\sim24%24\\%cheaper than its closest multi\-step competitor AWM \(∼\\sim117117K vs\.∼\\sim154154K tokens\) and an order of magnitude above single\-shot CoT/Program\-of\-Thoughts \(∼\\sim55–66K\)\. Per\-cell tokens, costs, and tool invocation rates are reported in Appendix[F](https://arxiv.org/html/2606.02994#A6)\.
#### Pure compute does not explain the CoT gap\.
A natural reading of*reasoning primitive induction*’s lift over zero\-shot CoT is that the multi\-step rollout simply spends more tokens on each instance\. To rule this out, we run a same\-model compute\-matched control: zero\-shot CoT with Self\-Consistency atN=20N\{=\}20\(majority vote over2020independent CoT samples,∼\\sim21×21\\timesCoT compute, comparable to*reasoning primitive induction*’s rollout budget\)\. On the same DeepSeek\-V3 backbone, SC@N=20@N\{=\}20reaches66\.0%66\.0\\%onMuSRmurder and45\.3%45\.3\\%on object—−9\-9pp and−24\-24pp below*reasoning primitive induction*\(Appendix[H](https://arxiv.org/html/2606.02994#A8)\)\. Twenty\-fold same\-model compute thus does not close the gap; the induced library contributes a separable lift that sampling\-based test\-time compute does not reproduce\. ReAct and AWM additionally consume comparable or larger token budgets than*reasoning primitive induction*without matching its accuracy on any cell\.
### 4\.4Multi\-model generalization
To verify that the “induction exceeds source” pattern is not an artifact of DeepSeek\-V3, we apply the same induction pipeline to Gemini Flash Lite—a smaller, weaker model from a different family—as both the trace\-source agent and the test\-time agent\. The induction algorithm, hyperparameters, and prompts are unchanged\. Table[2](https://arxiv.org/html/2606.02994#S4.T2)reports the result on twoMuSRsubtasks\.
Table 2:Self\-induction on Gemini Flash Lite\.Lite generates traces and deploys the induced library at test time, mirroring the DeepSeek\-V3 main\-results setup\. Same induction pipeline, hyperparameters, and prompts\.The same direction holds: induced libraries lift accuracy over the source ReAct agent on a smaller backbone from a different model family\. The source\-exceeding pattern reproduces across the two model families and subtasks tested here, indicating it does not depend on the specific backbone used as both inducer and target\.
## 5Experimental Setup
#### Benchmarks\.
We evaluate six subtasks from three task families\. NatPlan trip is excluded from method comparisons because all methods score below10%10\\%\.111We exclude three additional subtasks from evaluation: RuleArena*airline*\(baggage\-fee calculation\), RuleArena*tax*\(federal tax computation\), and NatPlan*calendar*\. The first two are arithmetic\-heavy domains addressed by a hybrid Python\-helper extension \(Appendix[I](https://arxiv.org/html/2606.02994#A9)\); the third is already saturated by zero\-shot CoT at86%86\\%\. Out\-of\-scope results appear in Appendix[I](https://arxiv.org/html/2606.02994#A9)\.
- •MuSRSprague et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib17)\)\(narrative deduction; 3\-way multiple choice over several\-paragraph scenarios\):*murder mystery*\(n=100n\{=\}100, identify the culprit from suspects, alibis, and evidence\),*object placement*\(n=106n\{=\}106, theory\-of\-mind belief tracking—where a character*believes*an object is after observed and unobserved moves\), and*team allocation*\(n=100n\{=\}100, soft\-preference assignment minimizing total dissatisfaction when no assignment satisfies every preference\)\.
- •RuleArena NBAZhou et al\. \([2025](https://arxiv.org/html/2606.02994#bib.bib42)\)\(rule application,n=66n\{=\}66\): determine whether a proposed roster move \(sign/trade/waive\) complies with the NBA Collective Bargaining Agreement \(loaded into the prompt as a∼\\sim5 KB excerpt\)\. Correctness is numeric within1%1\\%\.
- •NatPlanZheng et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib20)\)\(planning under constraints, free\-text plans\):*meeting planning*\(n=100n\{=\}100, schedule meetings under availability and location constraints\) and*trip planning*\(n=96n\{=\}96, multi\-leg itinerary under date/transport/stay constraints\)\.
MuSRis the standard multi\-hop reasoning testbed: non\-uniform question structure with internal contradictions \(e\.g\., no team assignment satisfies every preference\), resistant to a unified hand\-designed pipeline\.
#### Baselines\.
We compare against five baselines, each targeting a different alternative explanation:
- •Zero\-shot CoTWei et al\. \([2022](https://arxiv.org/html/2606.02994#bib.bib14)\): tests whether an agent loop is necessary at all\.
- •ReActYao et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib2)\): the trace source\. Tests whether*any*induced structuring beats the source\.
- •Hand\-designed primitives: a small typed decomposition per benchmark \(e\.g\.,parse→\\toplan→\\tocheck; 2–3 reasoning primitives\), built from the same typed pseudo\-tools as*reasoning primitive induction*but deployed as a fixed Python pipeline rather than through a ReAct loop\. Full per\-benchmark specs and dispatch rationale in Appendix[B](https://arxiv.org/html/2606.02994#A2)\. Tests whether discovery matches or surpasses expert design\.
- •Agent Workflow Memory \(AWM\)Wang et al\. \([2025a](https://arxiv.org/html/2606.02994#bib.bib3)\): our closest method\-level cousin\. Configured offline: a single static workflow per task family, induced once from the training partition \(Appendix[J](https://arxiv.org/html/2606.02994#A10)\)\. Tests whether natural\-language workflow guides match typed reasoning subroutines\.
- •Program\-of\-ThoughtsChen et al\. \([2023](https://arxiv.org/html/2606.02994#bib.bib15)\): emits and executes Python instead of natural\-language reasoning\. Tests whether code\-as\-reasoning is a stronger paradigm\.
#### Default configuration of*reasoning primitive induction*\.
Headline numbers useK=5K\{=\}5primitives synthesized at supportm=3m\{=\}3with fixed prompt templates \(only a per\-benchmark task description varies\)\. At deployment, the action space is induced library∪\{finish\}\\cup~\\\{\\textsc\{finish\}\\\}; the trace filter retains only correct rollouts; primitive descriptions surface only at invocation time \(in the tool registry, not the system prompt\)\. We call this*Variant A*; the two binary choices \(filter, descriptions\) yield four variants A–D swept in Appendix[C](https://arxiv.org/html/2606.02994#A3)alongsideK∈\{3,5,10\}K\\in\\\{3,5,10\\\}\.
#### Metrics\.
ForMuSR\(multiple\-choice\) and RuleArena NBA \(numeric\) we use exact / numeric\-tolerance match\. For NatPlan, strict\-parser exact match reads near zero due to format fragility; followingZheng et al\. \([2024](https://arxiv.org/html/2606.02994#bib.bib20)\)we report LLM\-judge accuracy and include strict\-parser numbers in Appendix[I](https://arxiv.org/html/2606.02994#A9)\. Token\-based costs are in Appendix[J](https://arxiv.org/html/2606.02994#A10)\.
#### LLM judge for NatPlan\.
NatPlan accuracy is evaluated by DeepSeek\-V3 using task\-specific factor\-based prompts \(Appendix[E](https://arxiv.org/html/2606.02994#A5)\)\. The judge receives only the predicted plan and reference plan, with no method identifier\. We cross\-validate three representative cells with Gemini Flash Lite as an independent judge: aggregate agreement across296296cases is96\.6%96\.6\\%\(Appendix[E](https://arxiv.org/html/2606.02994#A5)\)\.
#### Reporting protocol\.
Headline numbers \(Table[1](https://arxiv.org/html/2606.02994#S4.T1)\) use greedy decoding \(temperature 0\)\.
## 6Limitations
Arithmetic\-heavy tasks need an external tool\.Each pseudo\-tool is realized by an LLM interpreting its docstring at invocation time—the same property that lets the library encode reasoning moves likeverify\_alibi\_consistencyorcheck\_schedule\_conflictthat have no clean deterministic implementation\. The trade\-off is precision: on tasks dominated by long deterministic arithmetic chains \(e\.g\., RuleArena airline baggage\-fee calculation, RuleArena tax computation\), errors in any one generated step compound across the chain, and the simulate\-only configuration drops below zero\-shot CoT\. A hybrid extension that routes arithmetic through a Python helper while keeping the docstring’s natural\-language interface for extraction and rule selection recovers CoT\-level performance on airline \(Appendix[I](https://arxiv.org/html/2606.02994#A9)\); generalizing this to a full hybrid induction pipeline—automatically deciding which primitives need a deterministic body—is open\. NatPlan calendar is excluded for a separate reason: zero\-shot CoT already reaches86%86\\%, leaving little headroom for a multi\-step reasoning library\.
Cross\-family transfer is unexplored\.We induce one library per task family and evaluate it within that family\. Whether a library induced from one family aids reasoning on another—e\.g\., the murder\-mystery library transferred to team allocation, or a library trained jointly on multiple families capturing more general reasoning moves—remains open\. The current pipeline still imposes a re\-induction step when the test distribution shifts to a new task family, even though no per\-task authoring is required within a family\. A pretraining\-style “induce once on a heterogeneous trace corpus, deploy across many test families” setting is the natural next investigation\.
#### Broader impact\.
Induced primitives are typed callables with natural\-language docstrings, so the agent’s reasoning vocabulary is human\-readable and individually auditable: a deployer can inspect each primitive’s intent, edit a docstring to correct a systematic bias, or remove a primitive whose output cannot be safely trusted\. The same trace\-aggregation that recovers reasoning structure, however, also preserves systematic errors that recur across successful rollouts; if the source agent’s traces consistently encode a specific bias, that bias is likely to be synthesized into a primitive\. Deployment in high\-stakes domains should therefore pair induction with human review of the synthesized library and ongoing validation against external ground truth\.
## 7Conclusion
Induction exceeds its source\. Induced primitive libraries outperform the very agent that generated their traces by up to\+44\+44pp on RuleArena NBA and\+30\+30pp onMuSRteam allocation, surfacing reasoning structure the original policy expressed only inconsistently\. All paired\-Δ\\Deltaconfidence intervals against the source agent are strictly positive on the comparable subtasks\. The mechanism is implicit aggregation: synthesis sees representative thoughts from many successful rollouts and writes a corpus\-level specification of each recurring move, while the source agent reconstructs the move on the fly under local context—calling the stable specification at deployment is not equivalent to asking the source agent to repeat itself\. This pattern is consistent across narrative deduction, rule application, and constraint\-satisfaction planning, and reproduces on a smaller backbone \(Gemini Flash Lite\) when used as both inducer and target\. With two free parameters, three prompts, and one fixed configuration across benchmarks, induced libraries improve over zero\-shot CoT on every comparable cell, match or surpass expert\-authored decompositions, and outperform AWM at∼\\sim24%24\\%lower average inference cost\.
The trace source uses only generic search/lookup tools—no benchmark\-specific actions, no expert decomposition—so the recovered structure is not a recapitulation of supervisory signal but an emergent property of how an LLM agent reasons across instances of a task family\. Each primitive is a typed pseudo\-tool with an LLM\-interpreted docstring: inspectable, editable, and composable inside a standard ReAct loop, rather than baked into model weights or activation\-level steering vectors\.*Reasoning primitive induction*therefore offers a training\-free, single\-pass route to discovering reasoning structure that has previously required expert authoring\.
## References
- Snell et al\. \[2025\]C\. Snell, J\. Lee, K\. Xu, and A\. Kumar\.Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.In*ICLR*, 2025\.
- Yao et al\. \[2023\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\.ReAct: Synergizing reasoning and acting in language models\.In*ICLR*, 2023\.
- Wang et al\. \[2025a\]Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig\.Agent workflow memory\.In*ICML*, 2025\.
- Wang et al\. \[2025b\]Z\. Z\. Wang, A\. Gandhi, G\. Neubig, and D\. Fried\.Inducing programmatic skills for agentic tasks\.In*COLM*, 2025\.
- Zhang et al\. \[2025\]J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\.\-H\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang, B\. Zheng, B\. Liu, Y\. Luo, and C\. Wu\.AFlow: Automating agentic workflow generation\.In*ICLR \(Oral\)*, 2025\.
- Shi et al\. \[2026\]Z\. Shi, Y\. Zhu, J\. Shi, X\. Zhang, L\. Wang, and C\. Miao\.STIR: Internalizing LLM reasoning via discovery and replay of latent actions\.*arXiv preprint arXiv:2602\.04925*, 2026\.
- Sarch et al\. \[2024\]G\. Sarch, L\. Jang, M\. J\. Tarr, W\. W\. Cohen, K\. Marino, and K\. Fragkiadaki\.VLM agents generate their own memories: Distilling experience into embodied programs of thought\.In*NeurIPS*, 2024\.
- Zhou et al\. \[2024\]P\. Zhou, J\. Pujara, X\. Ren, X\. Chen, H\.\-T\. Cheng, Q\. V\. Le, E\. H\. Chi, D\. Zhou, S\. Mishra, and H\. S\. Zheng\.Self\-discover: Large language models self\-compose reasoning structures\.In*NeurIPS*, 2024\.
- Khattab et al\. \[2024\]O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts\.DSPy: Compiling declarative language model calls into state\-of\-the\-art pipelines\.In*ICLR*, 2024\.
- Agrawal et al\. \[2026\]L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab\.GEPA: Reflective prompt evolution can outperform reinforcement learning\.In*ICLR \(Oral\)*, 2026\.
- Khot et al\. \[2023\]T\. Khot, H\. Trivedi, M\. Finlayson, Y\. Fu, K\. Richardson, P\. Clark, and A\. Sabharwal\.Decomposed prompting: A modular approach for solving complex tasks\.In*ICLR*, 2023\.
- Kumar et al\. \[2025\]A\. Kumar, V\. Zhuang, R\. Agarwal, Y\. Su, J\. D\. Co\-Reyes, A\. Singh, K\. Baumli, S\. Iqbal, C\. Bishop, R\. Roelofs, L\. M\. Zhang, K\. McKinney, D\. Shrivastava, C\. Paduraru, G\. Tucker, D\. Precup, F\. Behbahani, and A\. Faust\.Training language models to self\-correct via reinforcement learning\.In*ICLR*, 2025\.
- Wang et al\. \[2023\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, S\. Narang, A\. Chowdhery, D\. Zhou, and E\. H\. Chi\.Self\-consistency improves chain of thought reasoning in language models\.In*ICLR*, 2023\.
- Wei et al\. \[2022\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*NeurIPS*, 2022\.
- Chen et al\. \[2023\]W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen\.Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks\.*TMLR*, 2023\.
- Zhou et al\. \[2024b\]A\. Zhou, K\. Yan, M\. Shlapentokh\-Rothman, H\. Wang, and Y\.\-X\. Wang\.Language agent tree search unifies reasoning, acting, and planning in language models\.In*ICML*, 2024\.
- Sprague et al\. \[2024\]Z\. Sprague, X\. Ye, K\. Bostrom, S\. Chaudhuri, and G\. Durrett\.MuSR: Testing the limits of chain\-of\-thought with multistep soft reasoning\.In*ICLR*, 2024\.
- Cohen & Cohen \[2024\]C\. A\. Cohen and W\. W\. Cohen\.Watch your steps: Observable and modular chains of thought\.*arXiv preprint arXiv:2409\.15359*, 2024\.
- Leng et al\. \[2025\]J\. Leng, C\. A\. Cohen, Z\. Zhang, C\. Xiong, and W\. W\. Cohen\.Semi\-structured LLM reasoners can be rigorously audited\.*arXiv preprint arXiv:2505\.24217*, 2025\.
- Zheng et al\. \[2024\]H\. S\. Zheng, S\. Mishra, H\. Zhang, X\. Chen, M\. Chen, A\. Nova, L\. Hou, H\.\-T\. Cheng, Q\. V\. Le, E\. H\. Chi, and D\. Zhou\.NATURAL PLAN: Benchmarking LLMs on natural language planning\.*arXiv preprint arXiv:2406\.04520*, 2024\.
- Ellis et al\. \[2021\]K\. Ellis, C\. Wong, M\. Nye, M\. Sablé\-Meyer, L\. Morales, L\. Hewitt, L\. Cary, A\. Solar\-Lezama, and J\. B\. Tenenbaum\.DreamCoder: Bootstrapping inductive program synthesis with wake\-sleep library learning\.In*PLDI*, 2021\.
- Grand et al\. \[2024\]G\. Grand, L\. Wong, M\. Bowers, T\. X\. Olausson, M\. Liu, J\. B\. Tenenbaum, and J\. Andreas\.LILO: Learning interpretable libraries by compressing and documenting code\.In*ICLR*, 2024\.
- Bowers et al\. \[2023\]M\. Bowers, T\. X\. Olausson, L\. Wong, G\. Grand, J\. B\. Tenenbaum, K\. Ellis, and A\. Solar\-Lezama\.Top\-down synthesis for library learning\.In*POPL*, 2023\.
- Wang et al\. \[2023b\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\.Voyager: An open\-ended embodied agent with large language models\.*arXiv preprint arXiv:2305\.16291*, 2023\.
- Shinn et al\. \[2023\]N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao\.Reflexion: Language agents with verbal reinforcement learning\.In*NeurIPS*, 2023\.
- Zhao et al\. \[2024\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\.\-J\. Liu, and G\. Huang\.ExpeL: LLM agents are experiential learners\.In*AAAI*, 2024\.
- Zelikman et al\. \[2022\]E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman\.STaR: Bootstrapping reasoning with reasoning\.In*NeurIPS*, 2022\.
- Qu et al\. \[2025\]Y\. Qu, A\. Singh, Y\. Lee, A\. Setlur, R\. Salakhutdinov, C\. Finn, and A\. Kumar\.RLAD: Training LLMs to discover abstractions for solving reasoning problems\.*arXiv preprint arXiv:2510\.02263*, 2025\.
- Hosseini et al\. \[2024\]A\. Hosseini, X\. Yuan, N\. Malkin, A\. Courville, A\. Sordoni, and R\. Agarwal\.V\-STaR: Training verifiers for self\-taught reasoners\.In*COLM*, 2024\.
- Wang et al\. \[2023c\]L\. Wang, W\. Xu, Y\. Lan, Z\. Hu, Y\. Lan, R\. K\.\-W\. Lee, and E\.\-P\. Lim\.Plan\-and\-solve prompting: Improving zero\-shot chain\-of\-thought reasoning by large language models\.In*ACL*, 2023\.
- Yao et al\. \[2023b\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan\.Tree of thoughts: Deliberate problem solving with large language models\.In*NeurIPS*, 2023\.
- Cai et al\. \[2024\]T\. Cai, X\. Wang, T\. Ma, X\. Chen, and D\. Zhou\.Large language models as tool makers\.In*ICLR*, 2024\.
- Gao et al\. \[2023\]L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig\.PAL: Program\-aided language models\.In*ICML*, 2023\.
- Liang et al\. \[2023\]J\. Liang, W\. Huang, F\. Xia, P\. Xu, K\. Hausman, B\. Ichter, P\. Florence, and A\. Zeng\.Code as policies: Language model programs for embodied control\.In*IEEE ICRA*, 2023\.
- Lyu et al\. \[2023\]Q\. Lyu, S\. Havaldar, A\. Stein, L\. Zhang, D\. Rao, E\. Wong, M\. Apidianaki, and C\. Callison\-Burch\.Faithful chain\-of\-thought reasoning\.In*IJCNLP\-AACL*, 2023\.
- Madaan et al\. \[2023\]A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark\.Self\-refine: Iterative refinement with self\-feedback\.In*NeurIPS*, 2023\.
- Opsahl\-Ong et al\. \[2024\]K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab\.Optimizing instructions and demonstrations for multi\-stage language model programs\.In*EMNLP*, 2024\.
- Schick et al\. \[2023\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\.Toolformer: Language models can teach themselves to use tools\.In*NeurIPS*, 2023\.
- Wang et al\. \[2024c\]Z\. Wang, D\. Fried, and G\. Neubig\.TroVE: Inducing verifiable and efficient toolboxes for solving programmatic tasks\.In*ICML*, 2024\.
- Yuan et al\. \[2024\]L\. Yuan, Y\. Chen, X\. Wang, Y\. R\. Fung, H\. Peng, and H\. Ji\.CRAFT: Customizing LLMs by creating and retrieving from specialized toolsets\.In*ICLR*, 2024\.
- Zheng et al\. \[2025\]B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig, and Y\. Su\.SkillWeaver: Web agents can self\-improve by discovering and honing skills\.*arXiv preprint arXiv:2504\.07079*, 2025\.
- Zhou et al\. \[2025\]R\. Zhou, W\. Hua, L\. Pan, S\. Cheng, X\. Wu, E\. Yu, and W\. Y\. Wang\.RuleArena: A benchmark for rule\-guided reasoning with LLMs in real\-world scenarios\.In*ACL*, 2025\.
## Appendix AInduced Primitive Library Gallery
For each in\-scope subtask we list the names, signatures, and docstring summaries of the five induced primitives produced by Algorithm[1](https://arxiv.org/html/2606.02994#alg1)atK=5K\{=\}5\. NatPlan trip is the exception: only three categories met the support thresholdm=3m\{=\}3on the training partition, so the library has size\|ℒ\|=3\|\\mathcal\{L\}\|=3\. Signatures are fixed by the synthesis prompt:\(narrative: str, focus: str\)forMuSR,\(context: str, focus: str\)for RuleArena,\(prompt: str, focus: str\)for NatPlan; all returnstr\.
### A\.1MuSRmurder
- •investigate\_evidence: analyze evidence for a focused suspect, weapon, or location across categories means / motive / opportunity / alibi consistency / physical evidence; flag hard disqualifiers and conflicting witnesses\.
- •initialize\_suspect\_analysis: identify the victim and all suspects from the narrative; collect initial clues per suspect; output a structured JSON of victim, suspects, and per\-suspect initial evidence\.
- •determine\_if\_killer: decide whether the focused suspect is the killer based on means, motive, opportunity, alibi consistency, and physical evidence; return’1’or’0’\.
- •summarize\_investigation\_progress: produce a structured per\-suspect summary of case status with confidence levels, contradictions, and recommended next investigation steps\.
- •evaluate\_suspect\_evidence: weigh evidence for or against a focused suspect; produce a categorized list \(motive / means / opportunity / physical evidence / alibi\) and an overall assessment of evidence strength\.
### A\.2MuSRobject placement
- •formulate\_search\_query: extract entities and actions from the focus phrase to construct a precise search string targeting a specific narrative segment about object movement\.
- •analyze\_character\_knowledge: track when a character was present or absent during object movements; output the character’s known locations, last\-known location, and reasoning over knowledge gaps\.
- •extract\_belief\_location: determine where a target character believes a specific object is located, accounting for theory\-of\-mind divergence between actual location and the character’s mental model\.
- •infer\_character\_belief: infer a character’s belief about object location from observations and timing of arrivals/departures relative to movements; return believed location plus step\-by\-step reasoning\.
- •track\_object\_movement\_timeline: reconstruct a chronological timeline of object movements with witnesses and visibility annotations per event\.
### A\.3MuSRteam allocation
- •analyze\_and\_summarize\_attributes: extract per\-person skills, weaknesses, and interpersonal dynamics from the narrative; bullet\-point summary highlighting hard disqualifiers and tricky trade\-offs\.
- •seek\_specific\_individual\_information: extract structured per\-person profiles \(strengths, weaknesses, characteristics, constraints\) with attention to comparative language and conflicts\.
- •define\_task\_requirements: extract task list, required skills per task, hard/soft constraints, and explicit/implicit objectives; flag disqualifying conditions\.
- •analyze\_team\_allocation: produce structured analysis of people\-tasks\-constraints suitable for downstream allocation reasoning, including inferred skills and relative strength comparisons\.
- •make\_final\_allocation\_decision: select the best assignment option index given evaluated constraints; return integer index of the chosen option\.
### A\.4RuleArena NBA
- •search\_cba\_rule: locate relevant CBA rule text matching a focused term or section; return excerpt with section reference \(e\.g\., “Article XI, Section 4\(b\)”\)\.
- •analyze\_operation\_compliance: assess whether a proposed trade, signing, or offer satisfies CBA constraints given current salary cap and team status; output compliance status, violated rules, and conditions for compliance\.
- •search\_cba\_term: case\-insensitive search for a defined CBA term, contract type, or threshold; return all matching context lines\.
- •search\_cba\_section: locate a specific section identifier \(e\.g\., “Section 8\(e\)\(1\)”\) and return its full text with surrounding context\.
- •search\_cba\_exception: locate the named exception \(e\.g\., Traded Player Exception, MLE\) and return eligibility rules, calculation method, and stacking restrictions\.
### A\.5NatPlan meeting
- •define\_meeting\_problem: parse start location/time, friends with locations / availability windows / minimum durations, and the travel\-time matrix; output structured JSON\.
- •request\_travel\_times: query the travel\-time matrix for pairs involving a focused person or location; return a list of relevant travel durations\.
- •propose\_meeting\_schedule: synthesize a time\-ordered schedule that maximizes friends met, respecting windows and travel; output sequence of arrival/departure events with travel segments\.
- •verify\_constraints: extract and validate constraints associated with a focused person or location; report constraint violations and missing information\.
- •analyze\_constraints\_and\_requirements: assess feasibility for the focused friend—earliest arrival, meeting window, minimum duration; output a feasibility verdict with timeline\.
### A\.6NatPlan trip \(\|ℒ\|=3\|\\mathcal\{L\}\|\{=\}3\)
- •define\_problem\_and\_strategy: extract required cities and stay durations, total trip length, flight adjacency, and any time\-window constraints; emit an initial scheduling strategy\.
- •search\_constraint\_details: search the prompt for constraints on a focused city, time window, or duration; return matching constraint summary\.
- •search\_flight\_network: query the flight adjacency for direct connections from the focused city or for the existence of a specific route\.
## Appendix BHand\-Designed Primitive Specifications
We list the per\-benchmark Hand\-designed specifications\. Each is a small typed\-pseudo\-tool decomposition built from the same interface as*reasoning primitive induction*\(typed interfaces with LLM\-interpreted docstrings\), but executed as a fixed Python pipeline that calls each primitive in a hand\-coded order rather than registering them in a ReAct action space\. The reasoning\-primitive counts are 3 \(murder, object, NBA, meeting, trip\) and 2 \(team allocation\); the three MuSR pipelines additionally share anextract\_indexstep that maps the chosen text to a multiple\-choice answer\. We adopt fixed\-pipeline dispatch because an expert decomposition is naturally deployed as a fixed pipeline; routing the same primitives through a ReAct loop would conflate the expert spec with the agent’s tool\-routing reliability\. Per\-step descriptions below paraphrase the actual sources\.
### B\.1MuSRmurder
- •extract\_suspects\_and\_evidence: extract victim, crime details, and per\-suspect evidence \(motive, means, opportunity, alibi claim, alibi witnesses, suspicious behavior, physical evidence\)\.
- •verify\_alibis: cross\-reference each suspect’s alibi against the narrative; identify alibi gaps, contradictions, and corroborating evidence\.
- •deduce\_murderer: synthesize evidence and verified alibis to identify the perpetrator; weight physical evidence and alibi contradictions over motive alone\.
### B\.2MuSRobject placement
- •extract\_movements: enumerate every object\-movement event chronologically with actor, source, destination, and lists of present/absent witnesses\.
- •extract\_discoveries: list incidental observations where a character learns an object’s location without witnessing the move \(saw directly, told by, inferred from event\)\.
- •infer\_belief: combine movements, discoveries, and a re\-read of the narrative to determine where the target person believes the object is located\.
### B\.3MuSRteam allocation
- •extract\_team\_requirements: extract roles and headcounts, hard/soft constraints, scoring rules, synergies, and conflicts from the narrative\.
- •score\_team\_assignments: score each candidate assignment against the requirements \(hard\-constraint disqualification, soft\-constraint scoring, scoring\-rule application\)\.
### B\.4RuleArena NBA
- •extract\_team\_operations: identify proposed signings/trades/waivers with salaries, contract types \(rookie/veteran/Bird/MLE\), and timing\.
- •find\_relevant\_cba\_rules: given the operations, identify and quote applicable CBA sections \(salary cap, luxury tax, Bird exception, rookie scale, MLE, trade match\)\.
- •check\_cba\_compliance: apply the rules to the operations and decide whether any operation violates the CBA\.
### B\.5NatPlan meeting
- •parse\_meeting\_info: extract starting location and time, friend list with locations, availability windows, and required durations, plus the travel\-time matrix\.
- •plan\_visit\_order: determine an ordering of friends that maximizes feasible meetings under windows and travel\.
- •build\_meeting\_plan: simulate the schedule step\-by\-step, format as “You start at …You travel to …You meetXXforNNminutes …”\.
### B\.6NatPlan trip
- •parse\_trip\_constraints: extract total trip days, per\-city stay durations, the flight adjacency, and any time\-window constraints\.
- •find\_valid\_route: find a city ordering where consecutive flights exist, durations fit within the budget, and time windows are satisfied\.
- •build\_trip\_plan: assign day ranges to each city, format as “Day 1–5: VisitXX/ Day 5: Fly toYY”\.
#### Comparison to induced libraries\.
The hand\-designed decompositions reflect a top\-down understanding of each task: parse→\\toplan→\\tocheck is the natural skeleton for narrative deduction, constraint\-satisfaction planning, and rule application\. Induced libraries \(Appendix[A](https://arxiv.org/html/2606.02994#A1)\) operate differently: they recover the per\-step*reasoning moves*the agent actually used \(e\.g\.,verify\_constraints,search\_cba\_exception,infer\_character\_belief\), which often span multiple of the hand\-designed boxes or invert their nesting\. The induced library tends to be flatter \(5 peer subroutines, agent\-orchestrated\) where the hand\-designed library is sequential \(2–3 steps in fixed order\)\.
### B\.7Hand\-designed primitives under ReAct dispatch
The headline Hand\-designed numbers in Table[1](https://arxiv.org/html/2606.02994#S4.T1)use the fixed Python pipeline\. To address the question of whether the discovery\-vs\-design comparison is confounded by dispatch \(fixed pipeline vs ReAct loop\), we additionally evaluate the same hand\-designed primitives registered as a ReAct action space, matching*reasoning primitive induction*’s deployment exactly\. The action space is the per\-task hand\-designed primitives∪\{finish\}\\cup~\\\{\\textsc\{finish\}\\\}, with the same per\-instance step / token budgets as*reasoning primitive induction*\.
Table 3:Hand\-designed primitives: fixed pipeline vs ReAct dispatch\.The fixed\-pipeline column reproduces the headline Hand\-designed row from Table[1](https://arxiv.org/html/2606.02994#S4.T1)\. ReAct\-dispatched registers the same primitives as the agent’s action space\.Two observations\. First, ReAct\-dispatched hand\-designed underperforms its fixed\-pipeline counterpart on most tasks \(murder−23\-23, object−3\-3, NBA−11\-11, trip−1\-1pp\), supporting our choice of the fixed pipeline as the strongest reading of expert design: rigid sequencing helps when the decomposition has a clear order \(extract→\\toverify→\\todeduce\), and ReAct’s tool\-routing introduces noise that the expert otherwise eliminates\. The two cells where ReAct dispatch helps \(MuSRteam\+16\+16, NatPlan meeting\+5\+5\) are the soft\-constraint tasks where fixed sequencing over\-commits to an ordering\. Second, induced libraries exceed hand\-designed primitives at matched dispatch on every cell \(\+1\+1to\+29\+29pp\)\. The discovery\-vs\-design gap is therefore not driven by the fixed\-pipeline\-vs\-ReAct dispatch difference; the same advantage holds when both methods deploy as ReAct action spaces\.
## Appendix CLibrary SizeKKand Deployment Factorial Sweeps
This appendix sweeps the two free parameters of the deployment configuration: library sizeKK, and the trace\-filter×\\timesdescription\-surfacing factorial that defines Variants A–D\.
#### Library sizeKK\.
Algorithm[1](https://arxiv.org/html/2606.02994#alg1)truncates the canonical\-category list at the topKKby frequency\. Table[4](https://arxiv.org/html/2606.02994#A3.T4)sweepsK∈\{3,5,10\}K\\in\\\{3,5,10\\\}on four subtasks\. The optimum is task\-dependent \(K=3K\{=\}3on murder/object,K=5K\{=\}5on team/meeting\);K=10K\{=\}10is never best and drops object\-placement by1717pp\. We retainK=5K\{=\}5as the cross\-benchmark default rather than tuneKKper task\.
Table 4:KK\-sensitivity sweep\.The optimum is task\-dependent:K=3K\{=\}3wins on murder/object,K=5K\{=\}5on team/meeting;K=10K\{=\}10is never best and reduces object\-placement accuracy by1717pp\. DefaultK=5K\{=\}5wins outright on22of44and is never substantially worst\.
#### Deployment factorial: trace filter×\\timesdescription surfacing\.
The default \(VariantA\) retains only correct rollouts and surfaces descriptions only at invocation\. Table[5](https://arxiv.org/html/2606.02994#A3.T5)sweeps the2×22\\times 2factorial onMuSR\. Filtering for correctness uniformly helps \(any→\\tocorrect:\+1\+1to\+8\+8pp\)\. Adding descriptions to the prompt interacts with the filter: it hurts on the correct\-filtered libraries \(A→\\toB:−6\-6pp murder,−5\-5pp team\) but can help on noisier libraries \(C→\\toD:\+13\+13pp team\)\. A wins outright on murder, ties on object, and has the highest per\-task mean \(70\.670\.6\); we retain it as the cross\-benchmark default\.
Table 5:Deployment factorial onMuSR\.2×22\\times 2sweep over trace filter \(any vs\. correct\-only\) and description surfacing \(off vs\. in\-prompt\)\. VariantA—correct\-only filter, descriptions surfaced only at invocation—is the headline default used elsewhere in the paper\. Bold marks the per\-task optimum\.
#### Capacity\-matched comparison to Hand\-designed\.
The Hand\-designed baseline uses 3 primitives in its fixed Python pipeline\. AtK=3K\{=\}3, induced libraries reach80\.080\.0\(MuSRmurder\) and71\.771\.7\(object\), beating Hand\-designed by\+11\.0\+11\.0pp and\+13\.2\+13\.2pp respectively\. The discovery\-vs\-design comparison reported in Figure[4](https://arxiv.org/html/2606.02994#S4.F4)is therefore not artificially favored by the induced library exposing more primitives than the Hand\-designed baseline; if anything, the induced library is more parsimonious at matched capacity\.
## Appendix DCross\-Seed Robustness onMuSR
ForMuSR, we ran all five baselines and*reasoning primitive induction*at three seeds; Table[6](https://arxiv.org/html/2606.02994#A4.T6)reports mean±\\pmsample standard deviation \(ddof=1=1\)\.
Table 6:Cross\-seed robustness onMuSR\.Mean±\\pmsample standard deviation across seeds\{42,123,456\}\\\{42,123,456\\\}\.Boldmarks the best method per column\.Three patterns emerge\. First, the rank ordering from the headline table is preserved at every cell:*reasoning primitive induction*dominates the five multi\-seed baselines on murder \(next\-best Program\-of\-Thoughts by\+7\.7\+7\.7pp\), object \(Program\-of\-Thoughts by\+6\.3\+6\.3pp\), and team \(Program\-of\-Thoughts by\+5\.0\+5\.0pp\)\. Second, the close call in the headline table where Program\-of\-Thoughts exceeded*reasoning primitive induction*on team by11pp \(6969vs6868\) inverts under multi\-seed averaging: Program\-of\-Thoughts scores69/61/6169/61/61across the three seeds while*reasoning primitive induction*scores68/66/7268/66/72, yielding a\+5\.0\+5\.0pp*reasoning primitive induction*advantage on the mean\. Third, induced libraries show consistently tight variance \(σ∈\[3\.1,4\.0\]\\sigma\\in\[3\.1,4\.0\]pp\); AWM’s variance is large on object \(±14\.1\\pm 14\.1\) and team \(±12\.5\\pm 12\.5\), and Program\-of\-Thoughts variance on murder is also wide \(±9\.1\\pm 9\.1\), where retrieval\-over\-memory or code\-generation outcomes are sensitive to which exemplars or seeds drive the prompt\. Induction over the per\-stepThoughtcorpus is more averaging\-stable\.
## Appendix ELLM\-Judge Validation
### E\.1Judge prompts
We use task\-specific factor\-based prompts\. Each prompt restricts the judge’s attention to the dimensions relevant for plan equivalence and asks for a singleyes/noverdict\.
#### Meeting\.
> You are judging whether a predicted meeting schedule achieves the same result as the correct plan\. Focus on: which friends were met, in what order, and whether the timing is feasible\. Minor formatting differences \(e\.g\. “9:00 AM” vs “9:00AM”, different sentence structure\) should be IGNORED\. What matters is: same friends met, same locations, same approximate times\. Correct plan:\{golden\} Predicted plan:\{predicted\} Does the predicted plan meet the same friends successfully? Reply with ONLY “yes” or “no”\.
#### Trip\.
> You are judging whether a predicted trip itinerary matches the correct plan\. Focus on: same cities visited, same order, same number of days in each city\. Minor formatting differences should be IGNORED\. Correct plan:\{golden\} Predicted plan:\{predicted\} Does the predicted plan visit the same cities in the same order for the same durations? Reply with ONLY “yes” or “no”\.
The judge sees only the predicted plan and the reference plan; method identity, model identity, and seed are not surfaced\.
### E\.2Cross\-judge agreement
We re\-ran the same prompts using Gemini Flash Lite as an independent judge to test for self\-evaluation bias\. The judge prompt was identical; only the model changed\. We selected three representative cells:*meeting Induced D*\(the headline NatPlan number for our method\),*trip Induced D*\(a low\-accuracy regime\), and*meeting AWM*\(a strong baseline cell, to test whether judge bias varies with method\)\.
Table 7:Cross\-judge agreement on NatPlan\.Numbers in parentheses are the YES counts under each judge\.Aggregate accuracy under both judges is identical at the cell level; the small disagreements are bidirectional \(Gemini is stricter on*trip Induced*and more lenient on*meeting AWM*\) and concentrated in the lowest\-accuracy regime\. On meeting Induced D—the cell most critical to our main claim—the two judges agree perfectly\. We retain DeepSeek\-V3 as the headline judge\.
## Appendix FCompute Footprint
We report per\-\(method, cell\) average tokens \(input\+\+output\) per test instance\. At Together AI’s DeepSeek\-V3 rate of $1\.25 per million tokens, dollar costs are a scalar transform of the token counts\.
Table 8:Tokens per instance \(input\+\+output, mean over the test set\)\.Numbers are in thousands of tokens\. NBA is the outlier across methods because each instance loads a∼\\sim5 KB CBA rules excerpt into the prompt for a multi\-step ReAct rollout\.Table 9:Mean tool invocations per instance, excludingfinish\.For multi\-step methods only; CoT and Program\-of\-Thoughts make a single LLM call per instance by construction\. ReAct and AWM counts are taken from the inner ReAct loop’s per\-step log;*reasoning primitive induction*counts are taken from its rollout, one entry per pseudo\-tool invocation\. Hand\-designed runs as a fixed Python pipeline \(no ReAct loop\) on every benchmark; the differing 0\.0 / 3\.0 / 2\.9 values reflect a per\-benchmark logging convention, not a difference in dispatch \(see paragraph below\)\.#### Reading invocation rates\.
The Hand\-designed0\.0/3\.0/2\.90\.0/3\.0/2\.9split is a per\-benchmark logging convention—MuSR records only ReAct steps, RuleArena and NatPlan record per\-primitive\-call—not a dispatch difference\.*Reasoning primitive induction*invokes the induced library44–1111times per instance, always above zero on every test instance\.
#### Reading the cost\.
NBA dominates total spend across all multi\-step methods because each instance loads the∼\\sim5 KB CBA excerpt into the prompt at every step\. Among multi\-step methods,*reasoning primitive induction*is comparable to or cheaper than AWM on every cell \(mean117117K vs154154K tokens,∼24%\{\\sim\}24\\%cheaper\)\.
#### What the cost numbers do \(and do not\) prove about the source of the gain\.
*Reasoning primitive induction*uses substantially more compute than CoT, so a natural alternative explanation for the headline gains over CoT is simply that more compute is being spent\. Two pieces of evidence rule out a pure “more compute” reading: the same\-model SC@N=20@N\{=\}20compute\-matched control on DeepSeek\-V3 lands99–2424pp below*reasoning primitive induction*onMuSRmurder and object \(Appendix[H](https://arxiv.org/html/2606.02994#A8)\); and the multi\-step baselines \(ReAct, AWM\) consume comparable or higher token budgets without matching*reasoning primitive induction*’s accuracy\.
## Appendix GBootstrap Confidence Intervals and Paired Comparisons
We compute 95% bootstrap confidence intervals on each cell of Table[1](https://arxiv.org/html/2606.02994#S4.T1)\(B==10,000 resamples with replacement, percentile method\)\. For paired comparisons \(*reasoning primitive induction*vs each baseline on each cell\), we resample case\-level indices in lockstep across the two methods to preserve per\-instance pairing, then take 95% percentiles of the resampled paired differences\.
Table 10:Per\-cell test accuracy \(%\) with 95% bootstrap CI\.Same numbers as Table[1](https://arxiv.org/html/2606.02994#S4.T1); CIs in brackets\.Table 11:PairedΔ\\Delta\(*reasoning primitive induction*−\-baseline\), 95% bootstrap CI\.Asterisk \(⋆\) marks comparisons where 0 is not contained in the 95% CI \(statistically significant atα=0\.05\\alpha\{=\}0\.05\)\.#### Reading the comparisons\.
The Induced\>\\,\>\\,ReAct gaps are uniformly significant on every cell except trip \(\+18\+18to\+44\+44pp\)\.*Reasoning primitive induction*significantly beats AWM on four cells \(murder, object, team, meeting\) and ties on NBA and trip; significantly beats Hand\-designed on team and meeting, with murder/object/NBA positive but within the CI; and significantly beats Program\-of\-Thoughts on meeting\.
## Appendix HCompute\-Matched Self\-Consistency Control
To corroborate the compute\-vs\-content discussion in §[4](https://arxiv.org/html/2606.02994#S4), we run zero\-shot Chain\-of\-Thought with Self\-Consistency atN=20N\{=\}20samples \(majority vote over 20 independent CoT rollouts\) on the same model used throughout the main paper, DeepSeek\-V3\. SC@N=20@N\{=\}20uses approximately21×21\\timesthe per\-instance compute of single\-shot CoT, comparable to*reasoning primitive induction*’s rollout budget\.
Table 12:Self\-Consistency atN=20N\{=\}20on DeepSeek\-V3\.20\-fold compute scaling on the same model does not close the gap to*reasoning primitive induction*\.Even at matched compute, SC@N=20@N\{=\}20remains99to2424pp below*reasoning primitive induction*on the two cells we ran\. Compute alone does not close the gap; the induced library’s content contributes a separable lift not reproducible by sampling\.
## Appendix IScope, Failure Modes, and Hybrid Codegen
### I\.1Out\-of\-Scope Evaluation Table
Three subtasks are excluded from evaluation by the criteria stated in §[5](https://arxiv.org/html/2606.02994#S5); their numbers appear in Table[13](https://arxiv.org/html/2606.02994#A9.T13)\.
Table 13:Scoped\-out subtasks\.Each violates one of the two scope criteria for*reasoning primitive induction*\(see §[5](https://arxiv.org/html/2606.02994#S5)\)\.†Preliminary hybrid variant \(see below\)\. Pure LLM\-interpreted induction falls to4%4\\%on a structured\-input variant\.
### I\.2Hybrid Code Generation Preview \(RuleArena Airline\)
The simulate\-only failure mode on airline has a clean diagnosis: the central reasoning step is arithmetic over a fee table, and induced primitives that simulate arithmetic via LLM dispatch compound small errors\. A hybrid extension—an induced primitive whose Python body delegates arithmetic to a deterministic helper while the LLM\-interpreted docstring handles extraction and rule selection—reaches51\.3%51\.3\\%atn=150n\{=\}150\. The oracle upper bound \(extraction fed ground truth\) is108/150=72%108/150=72\\%, so∼\\sim21pp of the gap is extraction error rather than architecture\. Hybrid code generation is the natural extension of this paper’s framework to deterministic\-subroutine domains and is left for future work\.
### I\.3NatPlan Strict\-Parser Numbers
Following the convention inZheng et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib20)\], we report LLM\-judge accuracy throughout the main paper\. The strict parser used by the original NatPlan release reads near zero on every method on meeting and trip due to format fragility \(the parser requires an exact serialization that the model rarely produces literally even when the underlying plan is correct\)\. Strict\-parser accuracy is0\.0%0\.0\\%on every method we evaluated for both meeting and trip—no method is meaningfully separable on the strict metric\.
## Appendix JImplementation Details
#### Data splits\.
For each in\-scope subtask we form fixed train / val / test partitions at75/75/rest75/75/\\text\{rest\}via the canonical shuffle for that benchmark and a fixed shuffle seed\. Induction consumes only the training partition; the test partition is reserved for evaluation\. Test sizes arentest=100n\_\{\\text\{test\}\}=100\(MuSRmurder, team\),106106\(MuSRobject\),6666\(RuleArena NBA\),100100\(NatPlan meeting\),9696\(NatPlan trip\)\.
#### Induction prompts\.
Algorithm[1](https://arxiv.org/html/2606.02994#alg1)uses three LLM prompts, fixed across benchmarks and parameterized only by a per\-benchmark task description: \(i\)*LabelMove*maps a single thought string to a 3–6\-word reasoning\-move label; \(ii\)*MergeSynonyms*clusters labels into 5–10 canonical categories when the unique\-label count exceeds 10; \(iii\)*Synthesize*emits the function namennand docstringddfrom a category and 5 representative thoughts\. Prompt templates are released with the codebase\.
#### ReAct configuration\.
Trace generation uses a generic search\-tool ReAct: a single\-tool action space \(substring/section search over the problem context\) plus afinishaction\. We cap the agent at5050ReAct steps per instance\. No domain\-specific actions, retrieval re\-rankers, or scratchpad augmentation are used at trace\-generation time\. At evaluation time, the agent’s action space is replaced by the induced libraryℒ∪\{finish\}\\mathcal\{L\}\\cup\\\{\\textsc\{finish\}\\\}\(Variant A\)\. For NatPlan, traces are filtered using LLM\-judged correctness rather than strict\-parser equality \(consistent with the headline metric\)\.
#### Baseline implementation parity\.
All methods use the same model, decoding setting \(greedy, temperature0\), test partitions, and agent framework where applicable\. Multi\-step agent baselines \(ReAct, AWM,*reasoning primitive induction*\) share the same ReAct dispatch prompt and the same per\-instance step / token budgets, differing only in the registered tools and the system\-prompt addendum\. Program\-of\-Thoughts is single\-shot by design \(one LLM call per instance, emitting a Python program\)\. Hand\-designed uses a fixed pseudo\-tool pipeline described below\.
#### AWM baseline\.
We implement Agent Workflow MemoryWang et al\. \[[2025a](https://arxiv.org/html/2606.02994#bib.bib3)\]as follows\. Trace induction \(offline\): we collect successful ReAct rollouts on the training partition \(samentrainn\_\{\\text\{train\}\}and same trace\-source agent as for our method\), format each rollout as a compressed thought\-and\-action sequence, and prompt the LLM with a fixed AWM\-style induction prompt \(released with the codebase\) to extract a single55–1010\-step natural\-language workflow per task family\. The induction LLM is the same model as the eval\-time agent \(DeepSeek\-V3\)\. Test\-time deployment: the induced workflow is prepended to the agent’s system prompt as a guide, and the agent runs a standard ReAct loop with the basic action space \(search\+\+lookup\+\+finish\)\. AWM is configured offline: a single static workflow per task family, induced once from the training partition and held fixed at test time\.
#### Program\-of\-Thoughts baseline\.
We implement Program\-of\-ThoughtsChen et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib15)\]as follows\. For each test instance, the LLM is prompted once to emit a Python program that, when executed, returns the answer; the program runs in a sandboxed Python interpreter\. Following the clean Chen et al\. TMLR 2023 setup, no benchmark\-specific scaffold tools are provided—the LLM relies entirely on Python built\-ins plus its own code\-generated helpers\. Same model and decoding setting as all other rows\.
#### Hand\-designed primitives baseline\.
A small typed decomposition per benchmark \(2–3 reasoning primitives; see Appendix[B](https://arxiv.org/html/2606.02994#A2)for exact counts\), authored by us using the same typed\-pseudo\-tool interface as*reasoning primitive induction*\(e\.g\.,parse→\\toplan→\\tocheckfor narrative/planning families,extract\_ops→\\tofind\_rules→\\tocheck\_compliancefor NBA\)\. Each step is a typed pseudo\-tool with a hand\-authored docstring\. Test\-time deployment differs from*reasoning primitive induction*in one respect: the primitives are dispatched by a fixed Python workflow function that calls them in the hand\-coded order, rather than being registered as a ReAct action space\. Full specifications and the rationale for this dispatch appear in Appendix[B](https://arxiv.org/html/2606.02994#A2)\.
#### Cost\.
Token\-based costs use the published Together AI rate of $1\.25 per million tokens \(in\+\+out, equal pricing\) for DeepSeek\-V3, which matches LiteLLM’s built\-in rate; we verified the LiteLLM cost column against Together’s billed amount and find agreement at the few\-percent level\. Per\-cell mean tokens and dollars are reported in Appendix[F](https://arxiv.org/html/2606.02994#A6)\.
#### Code and reproducibility\.
The full pipeline—induction prompts, trace\-generation configs, evaluation harness, and the induced libraries reported here—is publicly available at[https://github\.com/lexilei/reasoning\-primitives](https://github.com/lexilei/reasoning-primitives)\.
## Appendix KExtended Related Work
#### Workflow and skill induction from agent traces\.
Agent Workflow MemoryWang et al\. \[[2025a](https://arxiv.org/html/2606.02994#bib.bib3)\]extracts natural\-language workflow guides from web\-agent traces and retrieves them at test time, in an online setting where the memory is updated as new successful trajectories arrive\. Agent Skill InductionWang et al\. \[[2025b](https://arxiv.org/html/2606.02994#bib.bib4)\]similarly mines Python skills for browser automation, but the induced skills are action\-space programs that issue concrete environment commands rather than reasoning subroutines\. SkillWeaverZheng et al\. \[[2025](https://arxiv.org/html/2606.02994#bib.bib41)\]demonstrates a related transfer dynamic: APIs synthesized by a strong web agent uplift weaker downstream agents on WebArena\. We observe a complementary effect in a harder reasoning regime, where action grounding is absent and the source agent uses only generic search/lookup tools\. AFlowZhang et al\. \[[2025](https://arxiv.org/html/2606.02994#bib.bib5)\]performs a top\-down MCTS search over workflow graphs, treating workflow discovery as combinatorial optimization rather than as bottom\-up abstraction over traces\. TroVEWang et al\. \[[2024c](https://arxiv.org/html/2606.02994#bib.bib39)\]grows a Python toolbox while solving programmatic tasks, with periodic utility\-based trimming; our induction is single\-pass and emits LLM\-backed pseudo\-tools rather than full Python code\. STIRShi et al\. \[[2026](https://arxiv.org/html/2606.02994#bib.bib6)\]discovers latent reasoning primitives and injects them as activation\-level steering signals into the model’s hidden states at runtime, so the resulting library is implicit and embedded in trajectory\-control vectors rather than exposed as named callable units\. Our work differs along three axes simultaneously: we operate in*reasoning space*rather than environment\-grounded action space, we induce primitives in a*single pass*without online iteration, search, or activation steering, and our induced atoms are typed pseudo\-tools the agent freely composes inside a standard ReActYao et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib2)\]loop rather than pre\-specified workflows whose control flow is fixed at induction time\. The conceptual distinction is direct: AWM/ASI/SkillWeaver reuse completed workflows or environment\-grounded skills as test\-time guides, whereas we induce abstractions of*individual reasoning moves*as composable typed pseudo\-tools that the same agent can deploy to outperform its original ReAct policy\.
#### Reasoning\-structure prompting\.
Self\-DiscoverZhou et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib8)\]composes task\-specific reasoning structures by selecting from a fixed pool of3939human\-authored reasoning modules; because the atomic vocabulary is hand\-specified rather than learned, we exclude it from our learning\-from\-traces baselines but include it here as the closest prompt\-level analogue\. Decomposed PromptingKhot et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib11)\]relies on human\-designed decomposers paired with bespoke sub\-task prompts, again pushing the structural burden onto the prompt engineer\. Plan\-and\-SolveWang et al\. \[[2023c](https://arxiv.org/html/2606.02994#bib.bib30)\]and Tree\-of\-ThoughtsYao et al\. \[[2023b](https://arxiv.org/html/2606.02994#bib.bib31)\]likewise prescribe a generic skeleton \(plan\-then\-execute, search\-over\-thoughts\) into which task content is poured\. RLADQu et al\. \[[2025](https://arxiv.org/html/2606.02994#bib.bib28)\]is closer in spirit because it trains models to propose problem\-specific reasoning abstractions and solve conditioned on them, but it learns an abstraction generator through reinforcement learning and produces input\-dependent textual hints\. In contrast, our method performs single\-pass post\-hoc induction from agent traces and exposes task\-level abstractions as callable pseudo\-tools in a ReAct action space\. All of these methods structure reasoning, but they either rely on human\-authored inventories or learn an abstraction\-producing policy; our induction directly discovers the inventory from trace data\.
#### Library learning and tool creation\.
The classical library\-learning paradigm in program synthesis iteratively refactors accepted solutions into a shared library of reusable abstractions: DreamCoderEllis et al\. \[[2021](https://arxiv.org/html/2606.02994#bib.bib21)\]alternates wake\-sleep phases that compress solved programs intoλ\\lambda\-calculus combinators; LILOGrand et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib22)\]couples LLM\-guided synthesis with Stitch\-styleBowers et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib23)\]compression\-based refactoring; VoyagerWang et al\. \[[2023b](https://arxiv.org/html/2606.02994#bib.bib24)\]grows a Minecraft skill library through curriculum\-driven exploration with self\-verification\. The LLM\-era counterpart reframes the problem as tool*creation*rather than tool*use*: ToolformerSchick et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib38)\]teaches a model when to invoke given external APIs; PALGao et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib33)\]offloads computation to executable Python; LATMCai et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib32)\]introduces an explicit maker/user split where one LLM authors a Python utility for a class of tasks and another invokes it; CRAFTYuan et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib40)\]mines reusable code snippets from training\-example solutions and equips a downstream LLM with retrieval over the resulting toolset\. The closest design\-pattern precedent for our typed\-stub design is Code as PoliciesLiang et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib34)\], which synthesizes a Python program body from a docstring before deterministic execution\. Our pseudo\-tools make a different trade\-off: a natural\-language description is interpreted by an LLM at invocation time, which sacrifices deterministic precision but can encode reasoning moves likeverify\_alibi\_consistencythat would not admit a clean Python implementation\. Our refactor is not bottom\-up compression on syntax trees but an LLM\-driven label\-level merge over candidate primitives, and induction is single\-pass over a fixed trace corpus rather than an iterative wake\-sleep or curriculum loop\.
#### Test\-time compute structuring\.
Self\-ConsistencyWang et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib13)\]marginalizes over independent samples; LATSZhou et al\. \[[2024b](https://arxiv.org/html/2606.02994#bib.bib16)\]runs language\-agent tree search; Snell et al\.Snell et al\. \[[2025](https://arxiv.org/html/2606.02994#bib.bib1)\]characterize compute\-optimal trade\-offs between sampling and verification\. These methods all impose a hand\-designed search or aggregation structure over the reasoning process\. Our induced library can be read as structured test\-time compute in which the structure is induced from data rather than authored\. Crucially, the library specifies which reasoning moves are*available*, not which sequence to execute; the ReAct loop chooses the sequence on a per\-instance basis, so structure and control are cleanly separated\.
#### Prompt and pipeline optimization\.
DSPyKhattab et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib9)\]with MIPROv2Opsahl\-Ong et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib37)\]and GEPAAgrawal et al\. \[[2026](https://arxiv.org/html/2606.02994#bib.bib10)\]optimize prompts, demonstrations, and module hyperparameters*within*a fixed pipeline graph specified by the developer\. They are powerful precisely because the pipeline is given; they do not propose new modules or alter the action space\. Our method is complementary: we discover the modules themselves\. An induced primitive is a typed callable with a docstring, which is the natural unit a DSPy module wraps, so the induced library could be plugged into a DSPy program and further optimized end\-to\-end\. We view structure discovery \(this work\) and structure tuning \(DSPy/MIPROv2/GEPA\) as orthogonal axes that compose cleanly\.
#### Trace\-based self\-improvement and auditable chains\-of\-thought\.
ReflexionShinn et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib25)\], ExpeLZhao et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib26)\], STaRZelikman et al\. \[[2022](https://arxiv.org/html/2606.02994#bib.bib27)\], and V\-STaRHosseini et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib29)\]feed traces back as verbal feedback or supervised targets to retrain or self\-correct the policy; Self\-RefineMadaan et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib36)\]iteratively applies self\-feedback within a single attempt\. SCoReKumar et al\. \[[2025](https://arxiv.org/html/2606.02994#bib.bib12)\]similarly trains the model to revise its own outputs\. ICALSarch et al\. \[[2024](https://arxiv.org/html/2606.02994#bib.bib7)\]distills VLM\-agent experience into embodied programs of thought; the artifact is closer to an action\-space program and is multimodal\. PTPCohen & Cohen \[[2024](https://arxiv.org/html/2606.02994#bib.bib18)\]and SSRMLeng et al\. \[[2025](https://arxiv.org/html/2606.02994#bib.bib19)\]impose structure on individual chains\-of\-thought to support verifiability and audit; Faithful CoTLyu et al\. \[[2023](https://arxiv.org/html/2606.02994#bib.bib35)\]pushes the same agenda through translate\-then\-solve with a deterministic backend\. Our method takes traces as input like the self\-improvement work but produces a stand\-alone callable library: we never touch model weights and we never retrain\. Whereas PTP, SSRM, and Faithful CoT impose structure*within*each chain for inspection, we abstract structure*across many chains*for reuse, yielding a typed inventory of reasoning primitives that can be invoked, audited, and edited as code\.Similar Articles
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
This paper introduces Deep Reasoning, an inference-time approach that uses structured meta-reasoning to construct task-specific scaffolds for general-purpose agents. The proposed agent, Dolores, outperforms existing methods by distributing cognition across lower-load reasoning threads, reducing hallucinations and improving performance across multiple benchmarks.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
This paper introduces a method for monitoring the reasoning process of Large Reasoning Models by analyzing probe trajectories—the evolution of a concept's probability across generated tokens. The approach uses temporal and signal-processing features from hidden representations to better predict future model behavior, achieving up to 95% AUROC with max-pooling.
Tools as Continuous Flow for Evolving Agentic Reasoning
This paper introduces FlowAgent, a novel framework that reconceptualizes tool chaining as continuous trajectory generation using conditional flow matching to improve robustness in long-horizon agentic reasoning.
GraphReAct: Reasoning and Acting for Multi-step Graph Inference
This paper introduces GraphReAct, a framework that extends reasoning-acting paradigms to graph-structured data for multi-step inference. It combines topological and semantic retrieval with context refinement to improve performance on graph learning benchmarks.
ReasonOps: Operator Segmentation for LLM Reasoning Traces
ReasonOps introduces an unsupervised method for annotating chain-of-thought traces from large reasoning models, identifying 7 recurring reasoning operators. The method enables analysis of reasoning structure, model identification, and correctness prediction across 12 models and 8 benchmarks.