ReasonOps: Operator Segmentation for LLM Reasoning Traces
Summary
ReasonOps introduces an unsupervised method for annotating chain-of-thought traces from large reasoning models, identifying 7 recurring reasoning operators. The method enables analysis of reasoning structure, model identification, and correctness prediction across 12 models and 8 benchmarks.
View Cached Full Text
Cached at: 05/29/26, 09:14 AM
# ReasonOps: Operator Segmentation for LLM Reasoning Traces
Source: [https://arxiv.org/html/2605.29192](https://arxiv.org/html/2605.29192)
\\DeclareCaptionType
prompt\[Prompt\]\[List of Prompts\]\\NAT@set@cites
Daniel Lee Stanford University leedan@stanford\.edu&Owen Queen Stanford University oqueen@stanford\.edu&James Zou Stanford University jamesz@stanford\.edu
###### Abstract
Chain\-of\-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure\. Previous methods developed to analyze chain\-of\-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models\. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain\-of\-thought traces, providing succinct universal operators\. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators—discourse\-level moves such asBacktracking,Inferring, andHypothesizing—that emerge from unsupervised clustering of sentence\-initial 3\-token pivots\. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held\-out samples at 70–76% accuracy\. We analyze the structure of operators on easy vs\. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems\. Reasoning traces are highly model\-identifying: structural operator features plus anchor\-phrase text features recover the source model with macro\-AUC=0\.987=0\.987, revealing that each model family has a distinctive reasoning fingerprint\. Structural operator features predict within\-problem answer correctness well above baselines\. Classifiers built on these operators reach WP\-AUC=0\.701=0\.701globally and0\.8010\.801on AIME\. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP\-AUC=0\.664=0\.664for only 50% of the trace\. The ReasonOps pipeline is unsupervised and annotation\-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction\.
## 1Introduction
Reasoning\-capable large language models \(LLMs\) increasingly solve difficult problems by producing long intermediate traces before emitting a final answer\. LLMs have become synonymous with "large reasoning models" \(LRMs\) as the majority of frontier models today are post\-trained to elicit extended chain\-of\-thought reasoning before producing a final answerYanget al\.\([2025a](https://arxiv.org/html/2605.29192#bib.bib53)\); Teamet al\.\([2026](https://arxiv.org/html/2605.29192#bib.bib55)\); Guoet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib43)\); Agarwalet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib56)\)\. Prompting methods, multi\-path decoding, process supervision, and reinforcement\-learning\-based post\-training have all improved performance on mathematics, science, and coding benchmarks, but they have also made reasoning traces longer, more expensive, and harder to analyzeWeiet al\.\([2022](https://arxiv.org/html/2605.29192#bib.bib3)\); Lightmanet al\.\([2023](https://arxiv.org/html/2605.29192#bib.bib20)\); Guoet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib43)\)\. This comes at an increasing demand for monitoring and oversight of LLM decision\-making processesGuanet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib35)\); Korbaket al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib57)\); Bowmanet al\.\([2022](https://arxiv.org/html/2605.29192#bib.bib58)\)\.
Since the advent of chain\-of\-thought \(CoT\) promptingWeiet al\.\([2022](https://arxiv.org/html/2605.29192#bib.bib3)\), reasoning traces have been hypothesized as a window into the problem\-solving capabilities of LLMs\. Reasoning traces, or chain\-of\-thought traces, are valuable artifacts for black\-box understanding of LLMs as they can be obtained without access to model weights\. These traces have been shown to contain rich informationShojaeeet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib45)\); Kimet al\.\([2026](https://arxiv.org/html/2605.29192#bib.bib46)\), yet we lack a uniform vocabulary under which to characterize reasoning traces across models, domains, and datasets\. While some methods have attempted to provide systematic annotations for reasoning traces, these methods often rely ona priorivocabularies and syntactic structures that are overfit to current models or domainsYanget al\.\([2025b](https://arxiv.org/html/2605.29192#bib.bib42)\); Venhoffet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib44)\); Leeet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib41)\); Bogdanet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib36)\); Liet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib40)\)\. In this paper, we ask: can we compress diverse traces into a small vocabulary of reasoning operators shared across model families, tasks, and domains while preserving predictive information about success and failure?
Our contributions\.We introduce ReasonOps, an unsupervised framework for inducing a compact vocabulary of reasoning operators from visible chain\-of\-thought traces\. We use sentence\-initial 3\-token pivots, frequency filtering, and semantic clustering with e5\-small\(Wanget al\.,[2022](https://arxiv.org/html/2605.29192#bib.bib34)\)\. We show that the resulting 7 operators are quantitatively meaningful: they generalize across 12 thinking LLMs from 6 families and 8 benchmarks, confirmed by three independent LLM judges \(70–76% classification accuracy; chance: 14%\)\. Structural operator features plus anchor\-phrase text features identify the source model with macro\-AUC=0\.987=0\.987, revealing model\-family reasoning fingerprints\. We show that operator features predict within\-problem correctness above all span\-free baselines and the LLM self\-judge \(the OST reaches0\.7010\.701cross\-dataset, matching the content\-augmented Op\-XGB classifier while reading only operator labels\), and that the OST—trained once on full sequences—enables correctness estimation from partial traces at any depth, surpassing an Op\-XGB upper bound retrained per depth\. We open\-source our codebase as a resource for the community\.111Code:[https://github\.com/lee\-dan/ReasonOps](https://github.com/lee-dan/ReasonOps)
Figure 1:\(Left\) representative operator units extracted from AIME, GPQA, and LiveCodeBench\. \(Right\) representative reasoning trace extracted from R1\-Distill\-LLaMA\-70B on MATH dataset\.
## 2Related Work
Annotating and analyzing reasoning traces\.Deepseek\-R1 creators noted that "Wait" phrases were emergent behavior associated with reasoning capability increasesGuoet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib43)\)\. This and the interpretability of chain\-of\-thought traces inspired work into annotating and describing reasoning traces\. However, previous works primarily rely ona priorivocabularies of reasoningYanget al\.\([2025b](https://arxiv.org/html/2605.29192#bib.bib42)\); Venhoffet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib44)\); Karguptaet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib61)\), operate on arbitrary syntactic features such as sentencesLeeet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib41)\); Bogdanet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib36)\), or rely on domain\-specific vocabulariesLiet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib40)\)\. ReasonOps is an unsupervised method that does not assumea priorisyntactic features or a fixed vocabulary of annotations\. Other works have includedad hocmethods to analyze reasoning traces for particular hypothesesShojaeeet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib45)\); Kimet al\.\([2026](https://arxiv.org/html/2605.29192#bib.bib46)\)\.
Monitoring and scalable oversight\.The scalable oversight community has extensively explored ways to characterize LLM behaviorBowmanet al\.\([2022](https://arxiv.org/html/2605.29192#bib.bib58)\); Kentonet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib60)\); Engelset al\.\([2026](https://arxiv.org/html/2605.29192#bib.bib59)\), and chain\-of\-thought monitoring has been explored as a prominent directionKorbaket al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib57)\); Guanet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib35)\)\. This work is tangential to ReasonOps in terms of reasoning trace characterization\. Some work has also shown that chain\-of\-thought can be an unfaithful representation of model behaviorLanhamet al\.\([2023](https://arxiv.org/html/2605.29192#bib.bib48)\), but recent work has attempted to improve this in frontier modelsGuanet al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib35)\); Paulet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib49)\)\. Our work occupies a complementary point: we treat visible traces as behaviorally meaningful artifacts and ask whether they admit a stable, annotation\-free meso\-scale abstraction\.
Correctness prediction\.Some methods have been proposed for early correctness prediction, including LLM\-only methodsMiaoet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib37)\); Xianget al\.\([2026](https://arxiv.org/html/2605.29192#bib.bib38)\), and mechanistic interpretability techniquesZhaoet al\.\([2026](https://arxiv.org/html/2605.29192#bib.bib39)\)\. Similar to this line of work is that of process reward modelsLightmanet al\.\([2023](https://arxiv.org/html/2605.29192#bib.bib20)\)and outcome reward modelsCobbeet al\.\([2021](https://arxiv.org/html/2605.29192#bib.bib64)\), including work on verifiers in test\-time scalingZhanget al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib69)\); Hosseiniet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib70)\)\. This work has been particularly prolific in mathematics problemsCobbeet al\.\([2021](https://arxiv.org/html/2605.29192#bib.bib64)\); Lightmanet al\.\([2023](https://arxiv.org/html/2605.29192#bib.bib20)\); Wanget al\.\([2024a](https://arxiv.org/html/2605.29192#bib.bib65)\)\. Other work has explored test time scaling techniques for self\-verification, including SelfCheckMiaoet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib37)\)and othersWanget al\.\([2023](https://arxiv.org/html/2605.29192#bib.bib4)\); Brownet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib71)\); Snellet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib47)\)\.
## 3Methods
We now describe how we build an unsupervised mechanism to infer operators from reasoning traces across 12 LLMs\. Our pipeline is driven by linguistic principles and requires noa prioridefinition of an operator vocabulary\.
Data collection\.We collect reasoning traces from 12 thinking language models spanning 6 families \(Table[4](https://arxiv.org/html/2605.29192#A1.T4)in Appendix[A](https://arxiv.org/html/2605.29192#A1)\) on 8 benchmarks: MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.29192#bib.bib14)\), GPQA Diamond\(Reinet al\.,[2024](https://arxiv.org/html/2605.29192#bib.bib15)\), AIME 2024, LiveCodeBench\(Jainet al\.,[2024](https://arxiv.org/html/2605.29192#bib.bib16)\), HumanEvalChenet al\.\([2021](https://arxiv.org/html/2605.29192#bib.bib50)\), MMLU\-ProWanget al\.\([2024b](https://arxiv.org/html/2605.29192#bib.bib51)\), ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.29192#bib.bib1)\), and BIG\-Bench HardSuzgunet al\.\([2023](https://arxiv.org/html/2605.29192#bib.bib52)\)\. All models are queried with a 65,536\-token context budget\. Raw chain\-of\-thought tokens are collected via the OpenRouter API for all models except Claude, which is queried directly via the Anthropic API with extended thinking enabled; per\-model API details are given in Appendix[A](https://arxiv.org/html/2605.29192#A1)\. After removing truncated and malformed traces, we retain44,662 traces\(≈\\approx3,720 per model\)\.
Pivot\-based span segmentation\.Each trace is segmented intospans\. A span begins at a*pivot sentence*: any sentence whose first three alpha tokens \(the*sentence start*\) appear frequently enough across the corpus to be recognized as a discourse signal\. Formally,pivot\(s\)=\(w1,w2,w3\)\\mathrm\{pivot\}\(s\)=\(w\_\{1\},w\_\{2\},w\_\{3\}\)wherewiw\_\{i\}are the ordered lowercase alphabetic tokens at the start of sentencess\. A discourse pivotppis included if \(1\) it appears in≥100\\geq 100distinct traces \(frequency filter\), \(2\) it appears in traces from≥3\\geq 3distinct datasets \(domain\-diversity filter\), and \(3\) all three tokenswiw\_\{i\}belong to the top\-2,000 most frequent corpus tokens \(vocabulary filter\)\.
These thresholds are set to balance specificity and generality\. The frequency threshold of 100 traces is chosen to exclude idiosyncratic phrases that appear in only a handful of traces while retaining common discourse moves; lowering it to 50 adds many domain\-specific phrases that are not discourse\-functional\. The domain\-diversity filter of 3 datasets ensures that any accepted pivot spans multiple task types, preventing dataset\-specific phrases from entering the vocabulary\. The top\-2,000 vocabulary filter ensures pivots are composed of common English function words rather than domain\-specific content tokens \(e\.g\., chemistry formulae, programming keywords\); we verified that varying this between 1,500 and 3,000 does not qualitatively change the resulting clusters\.
Semantic pivot embedding and clustering\.We embed all discourse pivots using theintfloat/e5\-small\-v2sentence encoder\(Wanget al\.,[2022](https://arxiv.org/html/2605.29192#bib.bib34)\)\. Sentence\-level embeddings capture full\-phrase meaning, correctly separating pivots such as “let me think” and “let me verify” that share a common prefix\. We cluster pivot embeddings withKK\-means \(K∈\{6,…,11\}K\\in\\\{6,\\ldots,11\\\}, 30 restarts\), selectingKKby maximizing Cohen’sκ\\kappaagainst an independent LLM judge on held\-out spans\. We selectKKusing Cohen’sκ\\kappawith Claude Sonnet 4\.6 as judge;K=7K\{=\}7maximizesκ=0\.693\\kappa=0\.693\(Appendix[C](https://arxiv.org/html/2605.29192#A3)\)\.
Independence from correctness\.The operator discovery pipeline uses no correctness labels: pivots are filtered by frequency and domain diversity, embeddings are fit on pivot text alone, and cluster assignments are by nearest\-centroid lookup\. Correctness labels enter only in the downstream prediction step \(§[6](https://arxiv.org/html/2605.29192#S6)\), where Op\-XGB and the OST are trained in a standard supervised 5\-fold CV protocol\. The operators themselves are thus a fully unsupervised intermediate representation\.
Computational cost\.The discovery pipeline is lightweight\. Pivot extraction \(frequency counting over all sentence starts\) takes under 3 minutes on a single CPU core for 44,662 traces\. Embedding the 5,464 accepted pivots withe5\-small\-v2takes 6 seconds; K\-means on 5,464 points takes 12 seconds\. Annotating a new trace requires only a dictionary lookup per span \(no embedding at inference time\), adding<<1 ms per trace\. End\-to\-end operator discovery over the full corpus takes under 5 minutes, and annotating the entire 44K\-trace corpus takes under 2 minutes\.
## 4Reasoning operators
Table 1:The 7 discovered reasoning operators\. Representative pivots are the most frequent 3\-tuples in each cluster\.Seven operators emerge from unsupervised clustering \(Table[1](https://arxiv.org/html/2605.29192#S4.T1); Figure[1](https://arxiv.org/html/2605.29192#S1.F1)\)\. They span the space of discourse moves in extended reasoning: goal\-setting, fact\-anchoring, inference\-drawing, exploration, hedging, self\-correction, and constraint\-specification\. As a more coarse classification, operators can be broken intocommittal operators\(Initiating,Inferring,Constraining,Grounding\) andreflective operators\(Qualifying,Backtracking,Hypothesizing\)\. These operators recur across all 12 models and all 8 benchmarks, suggesting that extended reasoning—regardless of training data, architecture, or problem domain—organizes itself into a common compositional vocabulary\. Figure[1](https://arxiv.org/html/2605.29192#S1.F1)shows representative span excerpts per operator across three of the eight benchmarks\. An illustration of a full operator sequence in Figure[1](https://arxiv.org/html/2605.29192#S1.F1)and Appendix[J](https://arxiv.org/html/2605.29192#A10)\.
Evaluation with an LLM judge\.We first test whether the discovered operators are semantically coherent from the perspective of independent judges\. Each LLM judge receives only cluster names, descriptions, and the top 12 representative pivot phrases per cluster—no training spans\. It labels 50 held\-out spans per cluster \(n=350n=350total\)\. Table[2](https://arxiv.org/html/2605.29192#S4.T2)reports classification accuracy for three judges\. All reach 70–76% on the 7\-way classification task, 4\.9–5\.3×\\timesabove the 14\.3% chance baseline\. Adding 3 example spans per cluster to the prompt raises accuracy from 69\.8% to74\.6%: few\-shot examples aid pattern\-matching from a closed 7\-label set\.
Table 2:LLM\-judge validation \(n=350n\{=\}350spans, 50 per cluster\)\. Judges receive cluster names, descriptions, and top\-12 representative pivots\.Exemplars: prompt includes 3 example spans per cluster\. TheKK\-sweep is in Appendix[C](https://arxiv.org/html/2605.29192#A3)\.Cluster naming stability\.To assess the semantic stability of the clusters, we re\-ran the LLM naming step for the clusters by sampling 30 seeds of exemplars\. For each name and resulting description, we computed pairwise cosine similarity of description embeddings across the seeds\. The resulting descriptions were highly consistent within clusters with a median pairwise cosine similarity of0\.6600\.660, substantially above both an across\-cluster baseline of0\.4670\.467\(rrb=\+0\.72r\_\{\\mathrm\{rb\}\}=\+0\.72\) and a matched random\-group control of0\.5650\.565\(rrb=\+0\.39r\_\{\\mathrm\{rb\}\}=\+0\.39;p<10−149p<10^\{\-149\}\)\. All seven operators showed significant within\-cluster stability, indicating that the naming procedure recovers coherent and reproducible operator semantics; by contrast, randomly grouped spans did not elicit comparably coherent names\. See Appendix[K\.1](https://arxiv.org/html/2605.29192#A11.SS1)for more details\.
## 5Analysis of operator distributions
Figure 2:Distribution of operators across datasets \(top\) and models \(bottom\)\. Metrics are percent usage = average appearance of that operator for that given model/dataset vs\. all other operators\.Operators show distinct patterns when stratified by dataset and model \(Figure[2](https://arxiv.org/html/2605.29192#S5.F2)\)\. Mathematical reasoning benchmarks \(AIME, MATH\) show higherGroundingandConstraining\(step\-by\-step derivation\), while commonsense benchmarks \(ARC\-C, BBH\) use moreGroundingand elevatedHypothesizing\. Challenging multiple choice datasets like GPQA and MMLU\-Pro exhibit higher use ofHypothesizingandQualifyingthan any other dataset\. Coding benchmarks \(LiveCodeBench, HumanEval\) exhibit elevatedConstraining, consistent with specification\-first programming\. Models further show operator usage fingerprints that roughly cluster with model families, a phenomenon that’s supported by the results in §[6\.1](https://arxiv.org/html/2605.29192#S6.SS1)\. R1\-distill model profiles are both very similar, and Qwen models show almost identical operator proportions\. Kimi/Moonshot models show highConstrainingusage\. Claude Haiku, Claude Sonnet, and Grok\-3\-Mini use excessiveGrounding\. Despite these differences, all 7 operators are represented in every model\.
Transition structure\.The operator transition matrix reveals Markov structure in reasoning:Groundingstrongly self\-transitions \(sustained fact\-anchoring\), whileBacktrackingmost often leads intoInitiating\(correction followed by a fresh attempt\)\.Hypothesizingfrequently precedesInferring, consistent with a generate\-and\-test schema\. Run\-length analysis reinforces this:GroundingandConstrainingaccumulate long self\-runs, whileBacktrackingis almost always a single span—a brief interruption rather than a sustained state \(Figure[11](https://arxiv.org/html/2605.29192#A12.F11), Appendix[L](https://arxiv.org/html/2605.29192#A12)\)\. We expand on thisBacktrackingphenomenon in §[5\.2](https://arxiv.org/html/2605.29192#S5.SS2)\.
Figure 3:Shift of operator usage by benchmark in correct and incorrect traces\. % usage difference = take percent usage \(Figure[2](https://arxiv.org/html/2605.29192#S5.F2)\) and subtract usage in correct and incorrect traces\.### 5\.1Operator usage on correct vs\. incorrect traces
We see a few trends emerge in terms of operator usage on correct vs\. incorrect traces \(Figure[3](https://arxiv.org/html/2605.29192#S5.F3)\)\. First, committal operators seem to be overrepresented in correct traces than in incorrect traces, withGrounding,Inferring,Initiating, andConstrainingall being positive in correct \- incorrect usage\. Conversely, reflective operators are globally seen more often in incorrect traces than correct traces\.HypothesizingandQualifyingshow particularly higher usage in incorrect traces than correct traces\. This may be indicative of correct traces showing more confident, forward\-stepping operations rather than reflecting\. This then led us to investigate these dynamics on easy vs\. hard problems\.
Easy vs\. hard problems\.We next define a problem as “hard” if fewer than 1/3 of models solve it, and “easy” if more than 2/3 models pass\. We first filter to only BBH, LiveCodeBench, MMLU\-Pro, and GPQA, which have sufficient samples of hard problems\. For easy problems, correct traces exhibit significantly higher usage of committal operators than reflective operators, indicating confident forward\-stepping through the problem\. The usage of committal \- reflective operators can be defined as thecommittal\-reflective gap\. On easy problems, this gap is much higher in correct traces \(\+44\.2%\+44\.2\\%\) than in incorrect traces \(\+7\.5%\+7\.5\\%\) and is strongly predictive of correctness \(AUC =0\.760\.76\[0\.74,0\.78\]\[0\.74,0\.78\]95% CI\)\. On average across all problems, committal usage peaks early and late in traces while reflective operators appear mid\-trace \(Figure[4](https://arxiv.org/html/2605.29192#S5.F4)a\)\.
Hard problems show less obvious differences between correct and incorrect traces\. The committal\-reflective gap flips, showing\+30\.4%\+30\.4\\%on correct traces and\+34\.9%\+34\.9\\%on incorrect traces, a−4\.5%\-4\.5\\%shift which is not statistically\-significant \(Mann\-Whitney p=0\.15\) \(Figure[4](https://arxiv.org/html/2605.29192#S5.F4)a\)\. The predictive signal disappears as the AUC for correct vs\. incorrect via committal\-reflective gap alone drops to AUC =0\.480\.48\[0\.44,0\.51\]\[0\.44,0\.51\]\. However, examiningHypothesizingvs\.Inferring, instead of all committal and reflective operators, yields some effects\. When examining the distribution of these operators over time, theHypothesizing\-Inferringgap is particularly representative for correct vs\. incorrect hard problems\. This gap increases monotonically over the length of the trace, reaching6\.6%6\.6\\%at the tail of the trace, with incorrect traces showing much higher usage ofInferringoverHypothesizing\(Figure[4](https://arxiv.org/html/2605.29192#S5.F4)b\)\. This shows that reflective operators, specificallyHypothesizing, might provide some benefit on hard problems while harming performance on easy problems\.
Figure 4:\(a\) Temporal distribution of committal \- reflective operators for correct and incorrect problems\. 95% CI shown at each normalized position\. \(b\)Hypothesizing\-Inferringgap over only hard problems for correct and incorrect traces\. Shaded regions show the gap between correct and incorrect curves\.
### 5\.2Backtracking is a local operator
We centered the next analysis aroundBacktrackingandHypothesizing\. Both operators semantically appear to interrupt a model’s current line of reasoning, but we asked whether these operators actually reflected global strategic shifts in the model’s problem\-solving\. We used Claude Sonnet 4\.6 as an LLM judge to classify a stratified sample of GPQA events asLocal\(re\-checks a single calculation; method unchanged\),Sub\-Problem\(re\-opens a specific case\), orGlobal\(abandons the current strategy\), prompting the judge with the full reasoning trace and the event marked from the operator span to its next same\-operator span\. Local backtracking is explicitly defined as “The model is re\-checking a single calculation, fact lookup, or specific claim…”, and global event is defined as “model abandons or proposes to replace its current solution…”\.BacktrackingandHypothesizingare overwhelmingly local revisions rather than strategy changes, with statistically indistinguishable scope distributions:Backtrackingis 85\.4%Local/ 12\.8%Sub\-Problem/ 1\.6%Global\(n=1,294n=1\{,\}294\), andHypothesizingis 80\.2% / 18\.2% / 1\.6% \(n=192n=192\)\. This is supported by literature that shows that backtracking is often superficialWanget al\.\([2026](https://arxiv.org/html/2605.29192#bib.bib63)\)and rarely changes a model’s answerKanget al\.\([2025](https://arxiv.org/html/2605.29192#bib.bib62)\)\. More details are in Appendix[H](https://arxiv.org/html/2605.29192#A8)\.
## 6Downstream Applications
We now demonstrate how operators discovered by ReasonOps can be used to predict downstream properties of reasoning traces, namely model identity and correctness of the trace\.
### 6\.1Model Identification
Figure 5:Per\-model one\-vs\-rest AUC for the Op\-XGB model identification classifier, colored by family\. AUCs span0\.9670\.967\(R1\-Distill\-Llama\-70B\) to0\.9970\.997\(Kimi\-K2, Kimi\-K2\.5, and Grok\-3\-Mini, tied\); within\-family pairs \(R1\-Distill, Claude, Kimi\) sit at the lower end where intra\-family confusion is concentrated\.If operators capture stable behavioral structure, traces from the same model family should cluster together in operator space\. We trainOp\-XGB, an XGBoost classifier\(Chen and Guestrin,[2016](https://arxiv.org/html/2605.29192#bib.bib72)\)\(400 trees, max depth 6\) on a curated operator feature set: a 117\-dim handcrafted operator feature vector—global and quartile\-localized operator frequencies, bigram transition matrix, run\-length statistics, first/last operator one\-hots, and entropy/length scalars—concatenated with an 8,000\-feature anchor\-phrase TF\-IDF representation; the TF\-IDF is re\-fit inside each training fold \(full feature definitions in Appendix[E](https://arxiv.org/html/2605.29192#A5)\)\. Op\-XGB is trained under problem\-level 5\-fold CV with model identity as the target\. It achieves overall accuracy=79\.9%=79\.9\\%\(chance=8\.3%=8\.3\\%\) and macro\-AUC=0\.987=\\mathbf\{0\.987\}\. Per\-model one\-vs\-rest AUCs \(Figure[5](https://arxiv.org/html/2605.29192#S6.F5)\) span0\.9670\.967for R1\-Distill\-Llama\-70B to0\.9970\.997for Kimi\-K2, Kimi\-K2\.5, and Grok\-3\-Mini, with the lowest AUCs concentrated in within\-family pairs where intra\-family confusion is hardest\. The full 12\-class confusion matrix \(Figure[12](https://arxiv.org/html/2605.29192#A12.F12), Appendix[L\.2](https://arxiv.org/html/2605.29192#A12.SS2)\) makes this explicit: R1\-Distill\-Llama\-70B traces are predicted as R1\-Distill\-Qwen\-32B at0\.410\.41and the reverse at0\.290\.29\(the largest off\-diagonal mass\), consistent with their shared distillation lineage from DeepSeek\-R1; the Claude\-Haiku\-4\.5 / Claude\-Sonnet\-4\.5 pair shows mutual confusion in the0\.210\.21–0\.260\.26range; and Kimi\-K2\.5 is predicted as Kimi\-K2 at0\.100\.10\(with the reverse at0\.050\.05\)\.
### 6\.2Correctness Prediction
We next evaluate whether operator composition can be used to predict answer correctness\. We focus on the 6 benchmarks where operator structure is most informative: AIME, ARC\-Challenge, GPQA, LiveCodeBench, MATH\-500, and MMLU\-Pro\. We exclude BIG\-Bench Hard and HumanEval: BBH uses answer\-matching formats where operator structure has near\-chance predictive signal, and HumanEval correctness is determined by code execution rather than reasoning quality\. Appendix[I](https://arxiv.org/html/2605.29192#A9)shows that this exclusion is empirically motivated\. Our primary metric,WP\-AUC\(within\-problem AUC; adapted fromZhouet al\.\([2018](https://arxiv.org/html/2605.29192#bib.bib2)\)\), computes AUC separately for each problem’s trace group and averages—isolating the ability to discriminate better from worse attempts on the*same*problem, with difficulty factored out\. We use 5\-fold problem\-level CV, assigning all traces for a given problem to the same fold to prevent leakage from multi\-model evaluation\.
We compare six methods: \(1\)trace length\(log response character count\), \(2\)backtrack count\(fraction of spans labeledBacktracking\), \(3\)wait count\(raw count of “wait”\-prefixed spans\), \(4\)SelfCheckMiaoet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib37)\): the same model that generated the trace reads the problem and its full reasoning and predicts correctness without the gold answer—no training, \(5\)Op\-XGB, and \(6\)OST\(Operator Sequence Transformer; described in §[6\.3](https://arxiv.org/html/2605.29192#S6.SS3)\)\. Methods \(1\)–\(3\) use logistic regression on a single scalar feature with the same 5\-fold problem\-level CV protocol; since a 1\-D logistic regression is a monotone transformation, AUC is fold\-invariant \(CD==ID\)\. Method \(4\) requires no training\. Methods \(5\)–\(6\) are evaluated under two protocols:CD\(cross\-dataset\), trained on all datasets pooled, andID\(in\-dataset\), trained and evaluated within each dataset separately\.
Table[3](https://arxiv.org/html/2605.29192#S6.T3)reports WP\-AUC per dataset for all methods\.
Table 3:Within\-problem AUC \(WP\-AUC\) for correctness prediction\.CD= 5\-fold problem\-level CV trained on all datasets pooled;ID= 5\-fold CV trained and evaluated within each dataset\. Length, Backtrack, and Wait use logistic regression on a single monotonic feature; AUC is rank\-invariant to the training fold \(CD==ID\)\. Op\-XGB: 117\-dim operator features \(frequencies, quartile localization, bigram transitions, run lengths\) \+ 8K anchor\-phrase TF\-IDF, concatenated; XGBoost \(400 trees, depth 6\)\. See Appendix[E](https://arxiv.org/html/2605.29192#A5)for full feature details\. OST at 100% trace depth; see §[6\.3](https://arxiv.org/html/2605.29192#S6.SS3)for partial\-trace results\.Bold= best\.Underline= 2nd best\.Our methods\.Trace length is near chance globally \(WP\-AUC=0\.551=0\.551within\-dataset\)\. Backtrack count \(0\.5940\.594\) and wait count \(0\.6030\.603\) outperform length, confirming that reflective operator frequency carries signal even without full sequence modeling\. SelfCheckMiaoet al\.\([2024](https://arxiv.org/html/2605.29192#bib.bib37)\)—the same model that generated the trace reading its own reasoning and predicting correctness—reaches WP\-AUC=0\.512=0\.512globally \(near chance\), a striking null result: reasoning models are overconfident and predict “correct” for both correct and incorrect traces at similar rates\. Op\-XGB reaches0\.7010\.701cross\-dataset and0\.7230\.723within\-dataset globally, with strong within\-dataset performance on AIME \(0\.8380\.838ID\) and LiveCodeBench \(0\.7950\.795ID\)\. The OST reaches0\.7010\.701cross\-dataset and0\.7030\.703within\-dataset at full trace depth—tied with Op\-XGB cross\-dataset despite reading only operator labels, with no access to anchor phrase text\. OST surpasses Op\-XGB on ARC\-Challenge \(0\.6450\.645vs\.0\.6180\.618CD\) and MATH \(0\.6620\.662vs\.0\.5700\.570ID\), where temporal operator dynamics capture structure that aggregate feature counts miss\.
### 6\.3Early Correctness Prediction
Figure 6:Correctness prediction WP\-AUC at increasing trace depths for the three best\-performing datasets \(AIME, LiveCodeBench, GPQA\)\. Navy solid \(OST\): operator sequence transformer trained once on full sequences, reading the discrete operator label sequence at any prefix length without retraining\. Amber dashed \(Op\-XGB\-Early\): full Op\-XGB recipe \(117\-dim operator features \+ 8,000\-dim anchor TF\-IDF; Appendix[E](https://arxiv.org/html/2605.29192#A5)\), retrained from scratch at each depth as an upper\-bound comparison\. Red dashed \(SelfCheck\): the same model that generated the trace reads its own raw reasoning up to depthp%p\\%and predicts correctness with no training\. Shaded bands: 95% CIs\.A practical question is whether correctness can be estimated from a partial trace, before the model finishes reasoning\. We introduce theOperator Sequence Transformer\(OST\): a lightweight Transformer encoder \(∼800\{\\sim\}800K parameters; 4 layers,d=128d\{=\}128, 4 heads, pre\-LayerNorm\) trained end\-to\-end directly on operator sequences with no pretraining\. Each operator token is represented as the sum of a unigram embedding, a*bigram*\(transition\) embedding encoding the previous→\\tocurrent operator pair, and a continuous sinusoidal position encoding proportional to the token’s normalized position within the visible prefix—giving the model a fine\-grained sense of*where*in the trace each operator appears\. The model is trained with a pure contrastive loss that directly encourages correct traces to score higher than incorrect traces on the same problem, under the same 5\-fold problem\-level CV protocol as the other methods\. Crucially, the OST is*trained once on full operator sequences*; at inference time it accepts any operator\-sequence prefix without retraining, making it a practical single\-model early\-prediction system deployable at any trace depth\. More details are in Appendix[D](https://arxiv.org/html/2605.29192#A4)\.
Figure[6](https://arxiv.org/html/2605.29192#S6.F6)shows OST WP\-AUC vs\. trace depth for the three best\-performing datasets\. On AIME, WP\-AUC rises from0\.6450\.645at 10% depth to0\.7400\.740at 50% and0\.8010\.801at 100%—at half the trace it already surpasses all span\-free baselines\. LiveCodeBench and GPQA follow similar trajectories, reaching0\.7270\.727and0\.6910\.691at full depth\. Globally, the OST rises from WP\-AUC=0\.605=0\.605at 10% depth to0\.6640\.664at 50% and0\.7010\.701at 100%, monotonically improving with trace length\.
We compare againstOp\-XGB\-Early, which applies the same Op\-XGB feature recipe at each depthdd: we truncate every trace to its firstd%d\\%of spans, recompute the operator\-sequence statistics and the TF\-IDF on the truncated span set, and fit a fresh XGBoost from scratch\. Per\-depth retraining gives Op\-XGB\-Early an explicit upper\-bound advantage—it is re\-optimized for each truncation level—whereas the OST is trained only once on full sequences\. Globally, Op\-XGB\-Early reaches WP\-AUC0\.6200\.620at 10%,0\.6470\.647at 25%,0\.6610\.661at 50%,0\.6820\.682at 75%, and0\.6990\.699at 100%\. The OST and Op\-XGB\-Early track each other within a few thousandths across all depths: OST−\-Op\-XGB\-Early gaps are−0\.015\-0\.015,−0\.016\-0\.016,\+0\.003\+0\.003,\+0\.006\+0\.006, and\+0\.003\+0\.003at depths\{10,25,50,75,100\}%\\\{10,25,50,75,100\\\}\\%\. The single OST model thus matches a per\-depth\-retrained Op\-XGB—which has access to both the operator features*and*the raw anchor text—for partial\-trace prediction at 50% depth and beyond, despite reading only the discrete operator label sequence and never being retrained for partial inputs\.
By contrast, the partial\-trace SelfCheck baseline—the same model reading its own firstp%p\\%of raw reasoning text—is near chance at all depths \(0\.500\.50–0\.560\.56across datasets\), confirming that the OST’s signal comes from the structural operator representation rather than from text content alone\.
## 7Discussion
ReasonOps is an unsupervised method for annotating reasoning traces, built from careful observations of their semantic structure\. Three\-token pivots strike a useful balance for discourse analysis: they disambiguate the polysemy of single\-token markers like "but" or "so" \(e\.g\., distinguishing "wait," "wait no," and "wait let me" as backtracking, rejection, and correction respectively\) while remaining short enough to recur across tens of thousands of traces\. The top\-2,000 vocabulary filter is essential to this approach, as it excludes high\-frequency but domain\-specific phrases \(e\.g\., "the double bond" in chemistry traces\) that would otherwise contaminate the pivot set with non\-discourse\-functional content\. ReasonOps is a robust method to provide domain\-agnostic, model\-agnostic annotations of reasoning traces that enable universal analysis and downstream tasks such as model identification and correctness prediction\.
Future work and limitations\.We see several avenues for future work on annotating reasoning traces\. An extension could be made to agentic traces, i\.e\., traces of actions taken by an LLM when enabled with a harness\. The ReasonOps approach, which is unsupervised and does not assumea priorisets of operators, might be more useful for agentic traces where actions may change drastically based on the harness\. The operators define a structured vocabulary for step\-level correctness prediction without human annotation\. OST’s early correctness signal can trigger intervention \(e\.g\., beam expansion, early stopping\) at any trace position, enabling compute\-adaptive generation\. There are also several limitations of ReasonOps\. First, the pipeline relies on sentence\-initial pivots; spans bounded at a sub\-sentence level cannot be labeled\. Second, the operators are validated by LLM judgment, which introduces its own biases\. Third, cross\-model transfer is limited: operator frequency features partially capture model\-specific patterns that do not fully generalize\. Finally, the corpus covers 8 benchmarks in English; generalization to other languages is not verified\.
Broader impacts\.This work paves the way to a common language for describing reasoning traces of LLMs\. This could be used for further analysis of LLMs as well as downstream tasks beyond correctness prediction\.
#### Summary\.
Seven reasoning operators emerge from unsupervised analysis of 44,662 chain\-of\-thought traces across 12 thinking LLMs and 8 benchmarks\. Three independent LLM judges confirm the taxonomy \(70–76% classification accuracy; chance: 14%\)\. Structural operator features plus anchor\-phrase features identify the source model with macro\-AUC=0\.987=0\.987\. Structural features predict within\-problem correctness well above all span\-free baselines and above the LLM self\-judge \(0\.5120\.512, near chance\)\. The OST reaches WP\-AUC=0\.701=0\.701cross\-dataset—tied with Op\-XGB \(0\.7010\.701CD\) while reading only operator labels, not the additional text features Op\-XGB receives\. The OST predicts correctness monotonically from partial traces, reaching0\.6640\.664at 50% trace depth and surpassing an Op\-XGB baseline retrained at each depth, with no per\-depth retraining required\. The convergence of architecturally diverse models on the same 7 discourse\-level moves suggests that extended reasoning organizes itself into a common compositional vocabulary\.
## Acknowledgments
This work was supported in part by computational resources and research infrastructure from the Stanford AI Lab\. We thank colleagues at Stanford, including Grant Wilkins, Elana Simon, and members of the Zou lab for helpful discussions and feedback\.
## References
- S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1)\.
- Introducing Claude 4\.Note:[https://www\.anthropic\.com/news/claude\-4](https://www.anthropic.com/news/claude-4)Blog post, May 22, 2025Cited by:[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.8.7.4),[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.9.8.4)\.
- A\. Bercovich, I\. Levy, I\. Golan, M\. Dabbah, R\. El\-Yaniv, O\. Puny,et al\.\(2025\)Llama\-nemotron: efficient reasoning models\.arXiv preprint arXiv:2505\.00949\.Cited by:[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.11.10.4)\.
- P\. C\. Bogdan, U\. Macar, N\. Nanda, and A\. Conmy \(2025\)Thought anchors: which llm reasoning steps matter?\.arXiv preprint arXiv:2506\.19143\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p2.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- S\. R\. Bowman, J\. Hyun, E\. Perez, E\. Chen, C\. Pettit, S\. Heiner, K\. Lukošiūtė, A\. Askell, A\. Jones, A\. Chen,et al\.\(2022\)Measuring progress on scalable oversight for large language models\.arXiv preprint arXiv:2211\.03540\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1),[§2](https://arxiv.org/html/2605.29192#S2.p2.1)\.
- B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini \(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.External Links:2107\.03374Cited by:[§3](https://arxiv.org/html/2605.29192#S3.p2.1)\.
- T\. Chen and C\. Guestrin \(2016\)Xgboost: a scalable tree boosting system\.InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining,pp\. 785–794\.Cited by:[§6\.1](https://arxiv.org/html/2605.29192#S6.SS1.p1.11)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? Try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§3](https://arxiv.org/html/2605.29192#S3.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu,et al\.\(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.5.4.4),[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.6.5.4),[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.7.6.4)\.
- J\. Engels, D\. D\. Baek, S\. Kantamneni, and M\. Tegmark \(2026\)Scaling laws for scalable oversight\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=u1j6RqH8nM)Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p2.1)\.
- M\. Y\. Guan, M\. Wang, M\. Carroll, Z\. Dou, A\. Y\. Wei, M\. Williams, B\. Arnav, J\. Huizinga, I\. Kivlichan, M\. Glaese,et al\.\(2025\)Monitoring monitorability\.arXiv preprint arXiv:2512\.18311\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1),[§2](https://arxiv.org/html/2605.29192#S2.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.NeurIPS\.Cited by:[§3](https://arxiv.org/html/2605.29192#S3.p2.1)\.
- A\. Hosseini, X\. Yuan, N\. Malkin, A\. Courville, A\. Sordoni, and R\. Agarwal \(2024\)V\-STar: training verifiers for self\-taught reasoners\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=stmqBSW2dV)Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§3](https://arxiv.org/html/2605.29192#S3.p2.1)\.
- L\. Kang, Y\. Deng, Y\. Xiao, Z\. Mo, W\. S\. Lee, and L\. Bing \(2025\)First try matters: revisiting the role of reflection in reasoning models\.arXiv preprint arXiv:2510\.08308\.Cited by:[§5\.2](https://arxiv.org/html/2605.29192#S5.SS2.p1.2)\.
- P\. Kargupta, S\. S\. Li, H\. Wang, J\. Lee, S\. Chen, O\. Ahia, D\. Light, T\. L\. Griffiths, M\. Kleiman\-Weiner, J\. Han,et al\.\(2025\)Cognitive foundations for reasoning and their manifestation in llms\.arXiv preprint arXiv:2511\.16660\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- Z\. Kenton, N\. Y\. Siegel, J\. Kramár, J\. Brown\-Cohen, S\. Albanie, J\. Bulian, R\. Agarwal, D\. Lindner, Y\. Tang, N\. D\. Goodman,et al\.\(2024\)On scalable oversight with weak llms judging strong llms\.Advances in Neural Information Processing Systems37,pp\. 75229–75276\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p2.1)\.
- J\. Kim, S\. Lai, N\. Scherrer, J\. Evans,et al\.\(2026\)Reasoning models generate societies of thought\.arXiv preprint arXiv:2601\.10825\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p2.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- Kimi Team \(2025\)Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.12.11.4)\.
- T\. Korbak, M\. Balesni, E\. Barnes, Y\. Bengio, J\. Benton, J\. Bloom, M\. Chen, A\. Cooney, A\. Dafoe, A\. Dragan,et al\.\(2025\)Chain of thought monitorability: a new and fragile opportunity for ai safety\.arXiv preprint arXiv:2507\.11473\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1),[§2](https://arxiv.org/html/2605.29192#S2.p2.1)\.
- T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion,et al\.\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.arXiv preprint arXiv:2307\.13702\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p2.1)\.
- J\. Lee, S\. Mukherjee, D\. Hakkani\-Tur, and J\. Hockenmaier \(2025\)Reasoningflow: semantic structure of complex reasoning traces\.arXiv preprint arXiv:2506\.02532\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p2.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- M\. Li, N\. Zhang, C\. Fan, H\. Jiao, Y\. Fu, S\. Peters, Q\. Xu, R\. Lissitz, and T\. Zhou \(2025\)Understanding the thinking process of reasoning models: a perspective from schoenfeld’s episode theory\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 18278–18299\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p2.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1),[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- N\. Miao, Y\. W\. Teh, and T\. Rainforth \(2024\)SelfCheck: using llms to zero\-shot check their own step\-by\-step reasoning\.InThe Twelfth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=pTHfApDakA)Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1),[§6\.2](https://arxiv.org/html/2605.29192#S6.SS2.p2.1),[§6\.2](https://arxiv.org/html/2605.29192#S6.SS2.p4.14)\.
- D\. Paul, R\. West, A\. Bosselut, and B\. Faltings \(2024\)Making reasoning matter: measuring and improving faithfulness of chain\-of\-thought reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 15012–15032\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p2.1)\.
- Qwen Team \(2025\)QwQ\-32B: embracing the power of reinforcement learning\.Note:[https://qwenlm\.github\.io/blog/qwq\-32b/](https://qwenlm.github.io/blog/qwq-32b/)Blog post, March 6, 2025Cited by:[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.2.1.4)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof Q&A benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[§3](https://arxiv.org/html/2605.29192#S3.p2.1)\.
- P\. Shojaee, I\. Mirzadeh, K\. Alizadeh, M\. Horton, S\. Bengio, and M\. Farajtabar \(2025\)The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity\.arXiv preprint arXiv:2506\.06941\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p2.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. Le, E\. Chi, D\. Zhou,et al\.\(2023\)Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 13003–13051\.Cited by:[§3](https://arxiv.org/html/2605.29192#S3.p2.1)\.
- K\. Team, T\. Bai, Y\. Bai, Y\. Bao, S\. Cai, Y\. Cao, Y\. Charles, H\. Che, C\. Chen, G\. Chen,et al\.\(2026\)Kimi k2\.5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1)\.
- C\. Venhoff, I\. Arcuschin, P\. Torr, A\. Conmy, and N\. Nanda \(2025\)Understanding reasoning in thinking language models via steering vectors\.InWorkshop on Reasoning and Planning for Large Language Models,External Links:[Link](https://openreview.net/forum?id=OwhVWNOBcz)Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p2.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- H\. Wang, J\. Song, J\. Li, Q\. Zhu, F\. Mi, G\. Cui, Y\. Wang, and L\. Shang \(2026\)Teaching large reasoning models effective reflection\.arXiv preprint arXiv:2601\.12720\.Cited by:[§5\.2](https://arxiv.org/html/2605.29192#S5.SS2.p1.2)\.
- L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei \(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.arXiv preprint arXiv:2212\.03533\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p3.2),[§3](https://arxiv.org/html/2605.29192#S3.p5.8)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024a\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024b\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[§3](https://arxiv.org/html/2605.29192#S3.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1),[§1](https://arxiv.org/html/2605.29192#S1.p2.1)\.
- xAI \(2025\)Grok 3 beta — the age of reasoning agents\.Note:[https://x\.ai/news/grok\-3](https://x.ai/news/grok-3)Blog post, February 19, 2025Cited by:[Table 4](https://arxiv.org/html/2605.29192#A1.T4.4.10.9.4)\.
- Y\. Xiang, Y\. Ji, R\. Xu, D\. Qiao, Z\. Yang, J\. Li, and M\. Zhang \(2026\)When is thinking enough? early exit via sufficiency assessment for efficient reasoning\.arXiv preprint arXiv:2604\.06787\.Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p1.1)\.
- S\. Yang, Y\. Tong, X\. Niu, G\. Neubig, and X\. Yue \(2025b\)Demystifying long chain\-of\-thought reasoning\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.29192#S1.p2.1),[§2](https://arxiv.org/html/2605.29192#S2.p1.1)\.
- L\. Zhang, A\. Hosseini, H\. Bansal, M\. Kazemi, A\. Kumar, and R\. Agarwal \(2025\)Generative verifiers: reward modeling as next\-token prediction\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Ccwp4tFEtE)Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- Z\. Zhao, Y\. Koishekenov, X\. Yang, N\. Murray, and N\. Cancedda \(2026\)Verifying chain\-of\-thought reasoning via its computational graph\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CxiNICq0Rr)Cited by:[§2](https://arxiv.org/html/2605.29192#S2.p3.1)\.
- G\. Zhou, X\. Zhu, C\. Song, Y\. Fan, H\. Zhu, X\. Ma, Y\. Yan, J\. Jin, H\. Li, and K\. Gai \(2018\)Deep interest network for click\-through rate prediction\.InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pp\. 1059–1068\.Cited by:[§6\.2](https://arxiv.org/html/2605.29192#S6.SS2.p1.1)\.
## Appendix AModel Details
Table[4](https://arxiv.org/html/2605.29192#A1.T4)lists the 12 thinking models used for trace collection\. All expose raw chain\-of\-thought tokens accessible via API\. All models except Claude are queried via the OpenRouter API; reasoning tokens are returned in thereasoningfield of the OpenRouter response message\. Thereasoningrequest parameter is set to\{effort: high\}for Grok\-3\-Mini,\{max\_tokens: N\}for Qwen3 thinking variants, and omitted for DeepSeek, QwQ\-32B, Kimi, and Nemotron models\. Claude models are accessed directly via the Anthropic API withthinking: \{type: enabled, budget\_tokens: N\}enabled; raw thinking content is extracted fromtype: thinkingresponse blocks\.
Table 4:Models used for trace collection\. All 12 expose raw chain\-of\-thought tokens\.
## Appendix BCorpus Statistics
The full corpus spans 8 benchmarks and contains 44,662 traces across 12 models\. Table[5](https://arxiv.org/html/2605.29192#A2.T5)reports per\-dataset statistics for the 6\-dataset correctness prediction subset \(33,209 traces\), which excludes BBH and HumanEval due to grading ambiguity or near\-ceiling accuracy\. Each problem was attempted by all 12 models with up to 5 independent samples per model; the AIME trace count is lower because AIME 2024 has only 30 problems\.
Table 5:Per\-dataset trace counts in the 6\-dataset correctness prediction corpus\. All 12 models contribute to every dataset\. Accuracy = fraction of traces judged correct by the official grader\.The full operator\-discovery corpus additionally includes BIG\-Bench Hard and HumanEval \(44,662 traces total across all 8 benchmarks\)\. These two datasets are excluded from correctness prediction: BBH uses answer\-matching formats where operator structure shows near\-chance predictive signal, and HumanEval correctness is determined by code execution rather than reasoning quality\.
## Appendix CKKSelection
Table[6](https://arxiv.org/html/2605.29192#A3.T6)reports LLM\-judgeκ\\kappa\(Claude Sonnet 4\.6\) for eachKKin the sweep\.K=7K\{=\}7achieves the highestκ\\kappa\.
Table 6:KKselection via LLM\-judgeκ\\kappa\(Claude Sonnet 4\.6\)\.
## Appendix DOperator Sequence Transformer \(OST\) Architecture
The OST is a lightweight Transformer encoder trained end\-to\-end on operator token sequences\. Its design is optimized for long sequences of discrete operator labels \(typical length: 50–500 tokens\) while remaining parameter\-efficient enough to train in 5\-fold CV\.
#### Tokenization\.
Each span in a trace is assigned one of 7 operator labels \(integer indices0–66\)\. The OST processes the sequence of these labels for a trace prefix of specified depth\.
#### Input embeddings\.
Each positionttin the operator sequence receives a 128\-dimensional input vector formed as the sum of three components:
- •Unigram embedding: a learned embeddingEu\[ot\]∈ℝ128E\_\{u\}\[o\_\{t\}\]\\in\\mathbb\{R\}^\{128\}for the current operatoroto\_\{t\};
- •Bigram \(transition\) embedding: a learned embeddingEb\[ot−1,ot\]∈ℝ128E\_\{b\}\[o\_\{t\-1\},o\_\{t\}\]\\in\\mathbb\{R\}^\{128\}for the previous→\\tocurrent operator pair \(special “start” token fort=0t=0\);
- •Continuous sinusoidal position encoding: standard sinusoidal encoding scaled by the*normalized*positiont/Tt/Twithin the visible prefix, giving the model a fine\-grained sense of where in the trace each operator appears\.
#### Transformer body\.
The model uses a standard pre\-LayerNorm Transformer encoder with 4 layers,dmodel=128d\_\{\\text\{model\}\}=128, 4 attention heads, feed\-forward width=512=512, and5%5\\%token dropout during training\. Attention is full \(not causal\), since the model reads a complete prefix\.
#### Pooling and scoring\.
Sequence representations are aggregated by attention pooling \(learned query vector\) into a single 128\-dimensional vector, followed by a two\-layer MLP \(128→64→1128\\to 64\\to 1\) producing a real\-valued correctness score\. Total parameter count:≈\\approx800,000\.
#### Training\.
The OST is trained with a pure contrastive loss: for every problem with at least one correct and one incorrect trace in the training fold, all pairs\(i\+,i−\)\(i^\{\+\},i^\{\-\}\)of correct and incorrect traces for that problem are formed, and the loss pushes the correct trace to score higher\. This directly optimizes the within\-problem ranking that WP\-AUC measures\. Training uses AdamW \(η=3×10−4\\eta=3\\times 10^\{\-4\}, weight decay=0\.01=0\.01\), batch size 64, and early stopping on fold validation WP\-AUC\.
#### Early prediction\.
At inference time the OST is given only the firstp%p\\%of operator tokens \(by span count\), allowing evaluation at any trace depth\. The sinusoidal position encoding is re\-normalized to the partial prefix length\.
## Appendix EOp\-XGB Feature Details
Op\-XGB combines a 117\-dimensional handcrafted operator\-sequence feature vector with an 8,000\-feature anchor\-phrase TF\-IDF representation\. Both feature sets are derived from the operator labels and sentence\-initial pivot phrases—no substantive math, code, or reasoning content of the trace is used\.
#### Operator feature vector \(117 dim\)\.
For each trace’s operator sequenceσ=\(σ1,…,σn\)∈\{0,…,6\}n\\sigma=\(\\sigma\_\{1\},\\dots,\\sigma\_\{n\}\)\\in\\\{0,\\dots,6\\\}^\{n\}:
- •Global frequencies\(7\): the share of each operator across the full sequence,freqk=1n∑t𝟏\[σt=k\]\\mathrm\{freq\}\_\{k\}=\\tfrac\{1\}\{n\}\\sum\_\{t\}\\mathbf\{1\}\[\\sigma\_\{t\}=k\]\.
- •Quartile frequencies\(7×4=287\\times 4=28\): the share of each operator within each of four equal\-sized contiguous chunks ofσ\\sigma, capturing temporal localization\.
- •Scalar summaries\(5\): Shannon entropy of the global frequency vector, the maximum offreq\\mathrm\{freq\}, the number of distinct operators that appear,log\(1\+n\)\\log\(1\+n\), and the immediate\-repeat rate1n−1∑t≥2𝟏\[σt=σt−1\]\\tfrac\{1\}\{n\-1\}\\sum\_\{t\\geq 2\}\\mathbf\{1\}\[\\sigma\_\{t\}=\\sigma\_\{t\-1\}\]\.
- •First/last one\-hot\(7\+7=147\+7=14\): one\-hot encodings ofσ1\\sigma\_\{1\}andσn\\sigma\_\{n\}\.
- •Bigram transition matrix\(49\): the7×77\\times 7matrix of bigram frequencies1n−1∑t≥2𝟏\[σt−1=a,σt=b\]\\tfrac\{1\}\{n\-1\}\\sum\_\{t\\geq 2\}\\mathbf\{1\}\[\\sigma\_\{t\-1\}\{=\}a,\\sigma\_\{t\}\{=\}b\], flattened to a 49\-dimensional vector of normalized transition counts\.
- •Run\-length statistics\(7\+7=147\+7=14\): for each operatorkk, the mean and max length of consecutive runs ofkk, each normalized bynn\.
This vector is computed once per trace and depends only on the discrete operator label sequence; it carries no anchor text\.
#### Anchor\-phrase TF\-IDF \(8,000 dim\)\.
We additionally extract each span’s anchor \(the sentence\-initial pivot phrase, truncated to 80 characters\), concatenate all anchors of a trace with single spaces, and fitsklearn\.feature\_extraction\.text\.TfidfVectorizerwithmax\_features=8000andsublinear\_tf=Trueon the training\-fold anchor strings \(default unigram tokenization, sublinear TF1\+log\(tf\)1\+\\log\(\\mathrm\{tf\}\), IDF reweighting, vocabulary re\-fit inside each fold to prevent leakage\)\. The anchor strings consist only of discourse\-level pivot phrases \(e\.g\., “let’s denote”, “so we have”, “wait, that means”\)—they are exactly the inputs that the unsupervised operator\-discovery pipeline embeds and clusters into the 7 operators\.
#### Classifier\.
The 117\-dim operator feature vector and the 8,000\-dim TF\-IDF vector are concatenated to form an 8,117\-dim per\-trace representation, which is passed to XGBoost \(400 trees, max depth 6, learning rate 0\.05, subsample 0\.8, colsample\_bytree 0\.8, logloss objective\)\. The classifier is trained with problem\-level 5\-fold CV \(cross\-dataset\) or within each dataset separately \(in\-dataset\)\. XGBoost predicted probabilities are used directly as scores for WP\-AUC computation\. Op\-XGB therefore reads operator information at*both*levels of abstraction \(discrete labels and the raw pivot phrases that produce them\), in contrast to the OST, which reads only the discrete labels\.
#### Op\-XGB\-Early variant\.
For the early\-prediction comparison \(§[6\.3](https://arxiv.org/html/2605.29192#S6.SS3), Figure[6](https://arxiv.org/html/2605.29192#S6.F6)\), at each depthd∈\{10,25,50,75,100\}%d\\in\\\{10,25,50,75,100\\\}\\%we truncate each trace’s spans to the first⌈N⋅d/100⌉\\lceil N\\cdot d/100\\rceilspans, recompute the 117\-dim operator feature vector on the truncated operator subsequence, recompute the anchor TF\-IDF on the truncated anchor text, and re\-fit a fresh XGBoost on the concatenated representation under the same 5\-fold problem\-level CV\. This per\-depth retraining provides an upper bound on what the full Op\-XGB feature recipe can achieve when the model is re\-optimized for each truncation level—in contrast to the OST, which is trained once on full sequences and evaluated on partial prefixes without retraining\.
#### Op\-XGB for model identification\.
For 12\-class source\-model identification \(Figure[12](https://arxiv.org/html/2605.29192#A12.F12)\), we use the identical Op\-XGB feature recipe described above \(117\-dim operator features \+ 8,000\-dim TF\-IDF, concatenated to an 8,117\-dim vector\) and train XGBoost as a multi\-class classifier \(multi:softprob, same hyperparameters as the binary version\) under the same problem\-level 5\-fold CV protocol\. Reusing the full Op\-XGB recipe ensures that the same model class supports both correctness prediction and model identification\.
## Appendix FOperator Usage by Correctness per Benchmark
Figure[7](https://arxiv.org/html/2605.29192#A6.F7)shows operator frequency separately for correct and incorrect traces\.Inferringis consistently higher in correct traces;HypothesizingandBacktrackingare higher in incorrect traces\. The gap is largest on commonsense benchmarks \(ARC\-C, MMLU\-Pro:∼\\sim24% incorrect vs\.∼\\sim14% correct forHypothesizing\)\. On LiveCodeBench,Constrainingis 10 pp higher in correct traces\.
Figure 7:Operator frequency \(%\) per benchmark, split by correctness\. Left: correct traces\. Right: incorrect traces\.
## Appendix GTemporal Operator Profiles
Figure[8](https://arxiv.org/html/2605.29192#A7.F8)shows operator density as a function of normalized trace position\.Groundingdominates the early trace;Inferringconcentrates near the end\.Backtrackingshows a midtrace spike in incorrect traces—a signature of failed exploration not followed by recovery\.
Figure 8:Operator density as a function of normalized trace position \(Gaussian KDE\)\. Dashed reference = uniform \(1/7\)\. Correct: solid; incorrect: dashed\.
## Appendix HBacktracking analysis details
We classify eachBacktrackingandHypothesizingevent on GPQA asLocal\(re\-checks a single calculation; method unchanged\),Sub\-Problem\(re\-opens a specific case or branch\), orGlobal\(abandons the current solution strategy\), using Claude Sonnet 4\.6 as a judge\.
#### Sample\.
Events are drawn from GPQA traces with stratification over \(model, correctness\), yieldingn=1,294n\{=\}1\{,\}294Backtrackingandn=192n\{=\}192Hypothesizingevents\.
#### Prompt\.
The judge sees the GPQA problem and the full reasoning trace with the target event marked: a contiguous region from the target operator span to the next span of the same operator class \(or trace end\)\. The system rubric is reproduced below; the judge returns a JSON object\{"scope": \.\.\., "rationale": \.\.\.\}\. See prompt below for more information\.
Table 7:Scope classification forBacktrackingandHypothesizingevents on GPQA\.Prompt: Backtracking analysis judge promptYou are analyzing reasoning traces from large language models solving GPQA \(graduate\-level science\) questions\. Your task is to classify the SCOPE of a single REVISION\-ADJACENT region marked in the trace—a contiguous section where the model re\-checks, considers an alternative, hypothesizes, qualifies, or hesitates\.The marked region \(between`\>\>\>MARKER<<<`and`<<<END\>\>\>`markers\) starts at the target operator event and ends just before the next event of the SAME operator type \(or at the end of the trace\)\. Read the surrounding context to decide what the model is actually re\-thinking inside the marked region\.CLASSIFY THE SCOPE:LOCAL:The model is re\-checking a single calculation, fact lookup, or specific claim \(e\.g\., “wait,3×7=213\\times 7=21not1818”; “actually that should be oxygen”; “let me recompute that integral”; “I think this is the methyl group”\)\. The OVERALL APPROACH is unchanged—same method, same plan, just verifying or fixing one step\. Includes self\-confirmation moves and tentative single\-fact hypotheses\.SUB\_PROBLEM:The model is re\-opening or alternating among specific cases, branches, or sub\-questions within the larger problem \(e\.g\., “let me reconsider case 2”; “what about the limit wherex→0x\\to 0?”; “could this be the \(R\) instead of \(S\) configuration?”\)\. Strategy preserved but a discrete piece is redone or swapped\.GLOBAL:The model abandons or proposes to replace its current solution strategy/method with a fundamentally different approach \(e\.g\., “this Lagrangian approach isn’t working, let me try energy conservation”; “I shouldn’t use the ideal\-gas assumption—let me redo with van der Waals”; “I had the wrong mechanism entirely”; “let me use Gauss’s law instead of Coulomb”\)\. The high\-level plan changes\.Output format:a single JSON object on one line:`\{"scope": "LOCAL"\|"SUB\_PROBLEM"\|"GLOBAL", "rationale": "<one short sentence\>"\}`Be decisive\. Compare what the model is doing in the spans IMMEDIATELY before vs\. after the marked region—that local context tells you what is actually changing\. Judge each case on its merits without a default fallback category\.\{prompt\} System prompt used to classify the scope ofBacktrackingandHypothesizingspans in GPQA reasoning traces in backtracking analysis\.
## Appendix IPer\-Dataset Baseline Comparison
Figure[9](https://arxiv.org/html/2605.29192#A9.F9)shows WP\-AUC \(cross\-dataset protocol\) across all six baselines for each of the 6 benchmarks\. BIG\-Bench Hard and HumanEval are excluded \(see §[6\.2](https://arxiv.org/html/2605.29192#S6.SS2)and Appendix[B](https://arxiv.org/html/2605.29192#A2)\)\. Op\-XGB and OST dominate on math\-heavy and code benchmarks; the LLM self\-judge is near chance on all datasets\.
Figure 9:Per\-benchmark WP\-AUC \(cross\-dataset CV\) for all six methods\. Gray bars: span\-free baselines \(Length, Backtrack, Wait\)\. Red bar: LLM self\-judge\. Green bar: Op\-XGB \(operator n\-gram counts \+ anchor\-phrase TF\-IDF; XGBoost\)\. Blue bar: OST \(Operator Sequence Transformer\)\.
## Appendix JLabeled Trace Example
Table[8](https://arxiv.org/html/2605.29192#A10.T8)shows a representative labeled trace from R1\-Distill\-LLaMA\-70B on a MATH intermediate algebra problem \(incorrect\)\. Each span begins at a pivot sentence \(anchor, upright font\) followed by continuation sentences initalics\.
Table 8:Labeled trace with full span content\. Model: R1\-Distill\-LLaMA\-70B\. Dataset: MATH\. Correct: False\.
## Appendix KEvaluation
### K\.1Stability Analysis
#### Methodology\.
To assess whether the induced operator labels reflect stable semantic structure rather than arbitrary exemplar choice, we held theK=7K=7cluster assignments fixed and re\-ran the LLM naming step \(claude\-sonnet\-4\-6,T=0T=0; same prompt as the discovery pipeline—top\-12 anchornn\-grams plus 5 example spans per cluster\) across 30 exemplar seeds\. The topnn\-grams are deterministic per cluster; only the 5 exemplars vary by seed, isolating naming stability from the upstream clustering\. Each of the30×7=21030\\times 7=210resulting descriptions was embedded once withtext\-embedding\-3\-small\(1536\-d,L2L\_\{2\}\-normalized\), and we computed pairwise cosine similarity as the inner product\. We then formed three pair distributions: a*within\-real*distribution \(all\(302\)=435\\binom\{30\}\{2\}=435cross\-seed pairs per cluster,3,0453\{,\}045pairs pooled across the seven operators\); an*across\-real baseline*of2,0002\{,\}000random pairs\(i,j\)\(i,j\)withcluster\(i\)≠cluster\(j\)\\mathrm\{cluster\}\(i\)\\neq\\mathrm\{cluster\}\(j\)from the same description pool; and a*random\-group control*generated by the identical naming protocol applied to seven disjoint “fake clusters” of2,0002\{,\}000spans each, drawn uniformly from the full type\-A span pool—these mix every real operator in corpus\-rate proportions, so their topnn\-grams concentrate on high\-frequency meta\-phrases \(“now we need”, “let me think”\) with no operator\-specific signal\.
#### Statistical comparisons\.
We test whether the within\-real distribution stochastically dominates each of the two reference distributions using a one\-sided Mann–WhitneyUUtest \(alternative==greater\)\. For interpretability we report the rank\-biserial effect sizerrb=2U/\(n1n2\)−1r\_\{\\mathrm\{rb\}\}=2U/\(n\_\{1\}n\_\{2\}\)\-1, which rescalesUUto\[−1,\+1\]\[\-1,\+1\]and equals2P\(A\>B\)−12\\,P\(A\>B\)\-1: with sample sizes in the thousands theUU\-test rejects for trivial gaps, sorrbr\_\{\\mathrm\{rb\}\}is the load\-bearing summary of*how much*the distributions are separated\. The 435 within\-cluster pair cosines per cluster are not statistically independent—they share the 30 underlying seed embeddings—which makes theUU\-testpp\-values anticonservative;rrbr\_\{\\mathrm\{rb\}\}does not depend on this independence assumption and is therefore reported as the primary result\.
#### Results and interpretation\.
Figure[10](https://arxiv.org/html/2605.29192#A11.F10)shows the three cosine\-similarity distributions\. The within\-real pooled median cosine is0\.6600\.660versus0\.4670\.467for the across\-real baseline and0\.5650\.565for the random\-group control\. Real\-cluster names dominate the across\-cluster baseline withrrb=\+0\.72r\_\{\\mathrm\{rb\}\}=\+0\.72and the random\-group control withrrb=\+0\.39r\_\{\\mathrm\{rb\}\}=\+0\.39\(one\-sidedUU\-testp≈0p\\approx 0andp<10−149p<10^\{\-149\}, respectively\): roughly86%86\\%of within\-real pairs are more similar than a random across\-cluster pair, and69%69\\%more similar than a random within\-fake\-group pair\. Every individual operator clears the across\-cluster baseline \(per\-clusterUU\-testp<10−12p<10^\{\-12\};rrb∈\[\+0\.22\(grounding,weakest\),\+1\.00\(backtracking,strongest\)\]r\_\{\\mathrm\{rb\}\}\\in\[\+0\.22\\ \(\\textsc\{grounding\},\\mathrm\{weakest\}\),\\ \+1\.00\\ \(\\textsc\{backtracking\},\\mathrm\{strongest\}\)\]\)\. The random\-group control falls midway between within\-real and across\-cluster \(rrb=\+0\.44r\_\{\\mathrm\{rb\}\}=\+0\.44of random vs\. across; median gap=0\.098=0\.098\): LLM descriptions of mixed\-operator span pools are somewhat self\-consistent, but far below the within\-real ceiling, confirming that the coherence of real operator names is content\-driven rather than a generic tendency of the naming prompt to produce similar\-sounding sentences\.
Figure 10:Naming stability: pairwise cosine similarity distributions between cluster\-description embeddings generated under 30 exemplar seeds \(text\-embedding\-3\-small,L2L\_\{2\}\-normalized\)\. The*within\-cluster*distribution \(median=0\.660=0\.660\) is well separated from both the*across\-cluster*baseline \(median=0\.467=0\.467\) and the*random\-group control*\(median=0\.565=0\.565\), which applies the same naming protocol to seven disjoint random span groupings\. The gap confirms that operator names are stable at the level of meaning and are not an artifact of the LLM prompt\.
## Appendix LAdditional Analyses
### L\.1Transition Structure and Run\-Length Statistics
Figure[11](https://arxiv.org/html/2605.29192#A12.F11)shows the operator transition probability matrix and run\-length statistics\.Groundinghas the longest self\-runs, reflecting sustained fact\-anchoring that persists for multiple consecutive spans\.Backtrackingis nearly always a single span—it interrupts but does not persist—which is consistent with its function as a brief error\-signal rather than a sustained strategy change\. TheHypothesizing→\\toInferringpathway has elevated probability relative to most other off\-diagonal entries, consistent with a generate\-and\-test schema in which tentative scenarios are followed by conclusions\.
Figure 11:Left: operator transition probability matrix\. Entry\(i,j\)\(i,j\)= probability that operatorjjfollowsii\. Right: run\-length statistics per operator—mean and distribution of consecutive same\-operator span counts\.Groundinghas the longest self\-runs \(sustained fact\-anchoring\);Backtrackingis nearly always a single span\.
### L\.2Model Identification Confusion Matrix
Figure[12](https://arxiv.org/html/2605.29192#A12.F12)shows the confusion matrix for the 12\-way model identification task under problem\-level 5\-fold CV using the full Op\-XGB feature recipe \(Appendix[E](https://arxiv.org/html/2605.29192#A5)\)\. The classifier reaches accuracy=79\.9%=79\.9\\%and macro\-AUC=0\.987=0\.987\. Errors concentrate within model families\. The R1\-Distill\-Llama\-70B / R1\-Distill\-Qwen\-32B pair contributes the largest off\-diagonal mass: R1\-Distill\-Llama\-70B traces are predicted as R1\-Distill\-Qwen\-32B at0\.410\.41and the reverse at0\.290\.29, consistent with their shared distillation lineage from DeepSeek\-R1\. The Claude\-Haiku\-4\.5 / Claude\-Sonnet\-4\.5 pair shows mutual confusion in the0\.210\.21–0\.260\.26range, and Kimi\-K2\.5 is predicted as Kimi\-K2 at0\.100\.10\(with the reverse at0\.050\.05\)\. Off\-diagonal mass outside these family clusters is small\. Per\-model AUCs span0\.9670\.967\(R1\-Distill\-Llama\-70B\) to0\.9970\.997\(Kimi\-K2, Kimi\-K2\.5, Grok\-3\-Mini, all tied at this rounding\)\.
Figure 12:Model identification confusion matrix \(row\-normalized\) under problem\-level 5\-fold CV with the full Op\-XGB feature recipe \(Appendix[E](https://arxiv.org/html/2605.29192#A5)\)\. Accuracy =0\.7990\.799, macro\-AUC =0\.9870\.987\. Errors concentrate within model families \(R1\-Distill pair, Claude pair, Kimi pair\)\.\\c@NAT@ctr
## Appendix References
- \[1\]Qwen Team\.QwQ\-32B: embracing the power of reinforcement learning\.[https://qwenlm\.github\.io/blog/qwq\-32b/](https://qwenlm.github.io/blog/qwq-32b/), 2025\.Blog post, March 6, 2025\.
- \[2\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.arXiv preprint arXiv:2505\.09388, 2025\.
- \[3\]DeepSeek\-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, et al\.DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948, 2025\.
- \[4\]Anthropic\.Introducing Claude 4\.[https://www\.anthropic\.com/news/claude\-4](https://www.anthropic.com/news/claude-4), 2025\.Blog post, May 22, 2025\.
- \[5\]xAI\.Grok 3 beta — the age of reasoning agents\.[https://x\.ai/news/grok\-3](https://x.ai/news/grok-3), 2025\.Blog post, February 19, 2025\.
- \[6\]Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El\-Yaniv, Omri Puny, et al\.Llama\-nemotron: efficient reasoning models\.arXiv preprint arXiv:2505\.00949, 2025\.
- \[7\]Kimi Team\.Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534, 2025\.
- \[8\]Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S\. H\. Cai, Yuan Cao, Y\. Charles, H\. S\. Che, Cheng Chen, Guanduo Chen, et al\.Kimi k2\.5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276, 2026\.Similar Articles
ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Introduces ReasoningFlow, a framework to capture discourse structures of large language model reasoning traces as directed acyclic graphs, enabling fine-grained analysis of reasoning behaviors like self-reflection and backtracking. Based on manual and automatic annotation of thousands of traces, it reveals structural similarities across models and that most erroneous steps do not contribute to final answers.
Inducing Reasoning Primitives from Agent Traces
Introduces Reasoning Primitive Induction, a method that mines successful ReAct traces to cluster recurrent reasoning moves into typed pseudo-tools, outperforming the original agent by tens of percentage points on benchmarks.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.
The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces
The article discusses a shift in LLM reasoning research from making reasoning explicit via chain-of-thought to exploring latent reasoning that doesn't require language traces, questioning whether visibility is necessary for effective reasoning.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
This paper introduces a method for monitoring the reasoning process of Large Reasoning Models by analyzing probe trajectories—the evolution of a concept's probability across generated tokens. The approach uses temporal and signal-processing features from hidden representations to better predict future model behavior, achieving up to 95% AUROC with max-pooling.