Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Summary
This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.
View Cached Full Text
Cached at: 05/11/26, 07:00 AM
# Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Source: [https://arxiv.org/html/2605.07307](https://arxiv.org/html/2605.07307)
Yi\-Chang Chen1Feng\-Ting Liao1Da\-shan Shiu1Hung\-yi Lee2 1MediaTek Research2Artificial Intelligence Center of Research Excellence, National Taiwan University
###### Abstract
Modern reasoning language models generate dense, sequential chain\-of\-thought traces implicitly assuming that every token contributes and that steps must be consumed in order\. We challenge both assumptions through a systematic intervention pipeline—removal, masking, shuffling, and noise injection—applied to model\-generated reasoning chains across three models of different sizes and from different model families, evaluated on three challenging benchmarks spanning distinct domains\. Our findings are counterintuitive on three dimensions\.*Order*: Does the sequential order of a reasoning chain matter for answer extraction? No—line\-level shuffling reduces accuracy by less than 0\.5 percentage points \(pp\); word\-level shuffling retains 62%–89% accuracy; only token\-level shuffling collapses to near zero\. Crucially, pretrained\-only and instruction\-tuned variants of the same model exhibit near\-identical tolerance to both perturbations \(78\.67% vs\. 78\.00% under line shuffling\), indicating this order\-independence originates from pretraining rather than reasoning\-specific fine\-tuning\.*Dense*: Is all the information in a reasoning chain important for answer extraction? No—In our mathematical reasoning experiments, only symbolic and numeric content \(including the explicit answer occurrence\) is irreducible\. Masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose even improves accuracy by 4\.7 pp\.*Robustness*: Is a reasoning chain that is both order\-shuffling and non\-dense still robust? Yes—the most aggressively reduced representation \(all natural language removed, lines arbitrarily shuffled\) still achieves 83% accuracy, and injecting false answers at3×3\\timestrue\-answer frequency leaves accuracy entirely unchanged \(83\.3%→\\to83\.3%\), falsifying a frequency\-based extraction account\. Together, these results establish that answer extraction operates on a sparse, order\-shuffling, and structurally robust informational substrate, opening paths toward parallelized and token\-efficient reasoning generation\. All code and data are publicly available\.111[https://github\.com/mtkresearch/reasoning\-behavior](https://github.com/mtkresearch/reasoning-behavior)
## 1Introduction
Large reasoning language models—such as OpenAI’s o1\(OpenAI,[2024](https://arxiv.org/html/2605.07307#bib.bib3)\)and DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.07307#bib.bib4)\)—achieve strong performance on challenging multi\-step tasks by generating extended chain\-of\-thought \(CoT\) traces\(Weiet al\.,[2022](https://arxiv.org/html/2605.07307#bib.bib1)\)before producing a final answer\. At inference time, the answer\-generation model conditions on this entire trace to extract the final response\.
This paradigm rests on two implicit assumptions\. The first is*density*: every token in the reasoning trace contributes meaningfully to the final answer\. The second is*order*: reasoning steps must be produced and consumed sequentially\. Together these assumptions make reasoning inherently expensive: inference cost grows linearly with problem difficulty, and auto\-regressive dependency prevents parallelization\.
Neither assumption has been empirically validated for the answer\-*extraction*stage\. If extraction depends only on a sparse, unordered subset of the trace’s content, then the extraction stage imposes neither an ordering constraint nor a completeness constraint on what the generation stage produces—and the two stages could in principle be decoupled\. Such decoupling would liberate generation from its current constraints: if the extraction stage can tolerate order\-shuffling reasoning chains, reasoning steps need not be produced auto\-regressively and could instead be generated in parallel; if the extraction stage can tolerate sparse reasoning chains, intermediate tokens need not be produced exhaustively and could instead be selectively omitted\. We therefore ask:*what is the actual information structure a model relies upon when extracting an answer from a reasoning trace?*
To answer this empirically, we apply controlled transformations—removal\(alphabetic characters, question text\),masking\(alphabetic text, numeric digits, answer occurrences\),shuffling\(tokens, words, lines, and within\-line words\), andnoise injection—to model\-generated reasoning chains and measure the resulting change in answer accuracy across three models from different organizations and parameter scales \(GPT\-OSS\-120B,DeepSeek\-V3\.1\-671B,OLMo\-3\.1\-32B\) and three benchmarks covering three distinct reasoning tasks—mathematical computation \(AIME 2025\), algorithmic problem solving \(CodeElo\), and scientific inference \(GPQA\-Diamond\)—verifying that our observations are not attributable to a single model’s characteristics or a single domain\.
Our findings are counterintuitive\.On order\-shuffling: randomly permuting the*lines*of a reasoning chain reduces accuracy by less than 0\.5 percentage points \(pp\) across all models and all three domains; word\-level shuffling retains 62%–89% accuracy onAIME 2025while token\-level shuffling collapses to near zero, establishing word\-level semantic identity as the minimum necessary granularity for extraction\. Crucially, we found that pretrained\-only and instruction\-tuned variants of the same model exhibit near\-identical tolerance to both line\-level and word\-level shuffling, indicating this property is rooted in the transformer’s pretraining regime rather than reasoning\-specific fine\-tuning\.On sparsity: removing all alphabetic characters from a mathematical reasoning chain costs only 1\.3 pp of accuracy; masking alphabetic text actually*improves*accuracy by 4\.7 pp, indicating natural\-language prose is not the key element for answer extraction in this setting; a chain stripped of all natural language and shuffled into arbitrary line order still achieves 83% accuracy\.On robustness: injecting false\-answer sentences at three times the true\-answer frequency leaves accuracy entirely unchanged \(91\.3%→\\to91\.3%\)\. These results establish that answer extraction is*sparse*,*order\-shuffling*, and*structurally robust*—directly contradicting both assumptions of the current paradigm\.
Taken together, these findings suggest that transformers process reasoning chains more like*unordered sets of symbolic constraints*than sequential proofs—attending selectively to numeric relational structures while largely ignoring prose and positional arrangement\. This has direct practical implications: if extraction tolerates both reordering and content sparsity, the generation stage could in principle be freed from its current auto\-regressive, exhaustive constraints—reasoning steps could be generated in parallel, and uninformative prose tokens could be selectively omitted\.
This work makes the following contributions:
- •A systematic empirical framework for probing answer extraction\.A suite of controlled transformations \(removal, masking, shuffling, noise injection\) across three models \(GPT\-OSS\-120B,DeepSeek\-V3\.1\-671B,OLMo\-3\.1\-32B\) and three domains \(AIME 2025,CodeElo,GPQA\-Diamond\), enabling reproducible, fine\-grained isolation of which information the extraction stage actually uses\.
- •Reasoning chains are order\-shuffling at the line level; word\-level identity is the minimum necessary granularity\.Line\-order destruction causes negligible accuracy loss \(<<0\.5 pp\); word\-level semantic identity is the minimum necessary unit for extraction—suggesting reasoning steps need not be presented, or perhaps generated, sequentially\.
- •Order\-independence is present in pretrained models\.Pretrained and instruction\-tuned variants show near\-identical shuffling tolerance, ruling out reasoning\-specific fine\-tuning as the source of this property\.
- •Informational content is non\-uniformly distributed—alphabetic prose is redundant while numeric content and the explicit answer are irreducible\.A chain stripped of all natural language and presented in arbitrary line order still achieves 83% accuracy\.
- •The structural signal is intrinsically robust—even strong false\-answer injection cannot override it\.Extraction is unaffected at3×3\\timestrue\-answer frequency, indicating the signal is governed by relational constraint satisfaction over numeric structures rather than surface\-level frequency counting\.
## 2Related Work
A body of prior work has examined what information in reasoning traces actually drives model performance\.Minet al\.\([2022](https://arxiv.org/html/2605.07307#bib.bib24)\)show that the correctness of in\-context demonstration labels contributes surprisingly little to task performance—the label space, input distribution, and sequence format matter far more than literal semantic content\.Madaan and Yazdanbakhsh \([2022](https://arxiv.org/html/2605.07307#bib.bib25)\)decompose CoT prompts into structural patterns and semantic content, finding that patterns drive performance while factual content is largely expendable; their Concise CoT achieves over 20% token reduction with minimal accuracy loss\.Lanhamet al\.\([2023](https://arxiv.org/html/2605.07307#bib.bib26)\)show through intervention experiments that model reliance on the reasoning trace is highly task\-dependent, and that larger models tend to produce*less*faithful CoT\.Pfauet al\.\([2024](https://arxiv.org/html/2605.07307#bib.bib27)\)demonstrate that transformers can achieve strong performance on demanding tasks even when intermediate tokens are meaningless fillers—the primary mechanism is the extra computational budget, not semantic content\. Together, these results suggest that CoT reasoning chains contain substantial redundancy, and that computation space rather than semantic content may be the primary driver of performance\.
Our work extends this line by directly intervening on model\-generated reasoning chains and quantifying information geometry along two orthogonal axes: sparsity and order\-dependence\. Unlike prior work that modifies human\-authored demonstrations or analyzes model behavior indirectly, we apply a configurable transformation pipeline to chains produced by frontier models and measure the resulting accuracy change across multiple models and domains\. Removing all natural\-language text from model\-generated reasoning chains costs less than 2 pp of accuracy, and reordering lines causes negligible degradation, suggesting that sequential structure is less critical than the presence of the right informational tokens anywhere in the context\.
## 3Methodology
### 3\.1Problem Formulation
A traditional reasoning model generates an answeraagiven a questionqqand reasoning chainr=\(r1,r2,…,rn\)r=\(r\_\{1\},r\_\{2\},\\ldots,r\_\{n\}\):
P\(a∣q,r\)P\(a\\mid q,r\)\(1\)Our central hypothesis is that answer generation can succeed with a*sparse and order\-shuffling*representationr~=T\(r;θ\)\\tilde\{r\}=T\(r;\\,\\theta\):
P\(a∣q,r\)≈P\(a∣q,r~\)P\(a\\mid q,r\)\\approx P\(a\\mid q,\\tilde\{r\}\)\(2\)whereTTcontrols the degree of sparsity or reordering applied torr\. We operationalize this hypothesis by measuring answer accuracy under a suite of controlled transformations applied to model\-generated reasoning chains, using an untransformed chain as the baseline\.
Table 1:Transformation processors used in this work\. We implement five categories:Shuffling\(𝒮\\mathcal\{S\}\) permutes content at multiple granularities;Masking\(ℳ\\mathcal\{M\}\) replaces targeted character classes with a mask token;Removal\(ℛ\\mathcal\{R\}\) strips content entirely;Noise Injection\(𝒩k\\mathcal\{N\}\_\{k\}\) inserts false\-answer sentences atk×k\\timesthe true\-answer frequency;Randomisation baselines\(𝒟\\mathcal\{D\}\) replace tokens or words with random samples\. Processors compose sequentially \(e\.g\.,𝒮line∘ℛα\\mathcal\{S\}\_\{\\text\{line\}\}\\circ\\mathcal\{R\}\_\{\\alpha\}\)\.SymbolNameOperation𝒮tok\\mathcal\{S\}\_\{\\text\{tok\}\}Token\-ShufflePermute every subword token globally\.𝒮word\\mathcal\{S\}\_\{\\text\{word\}\}Word\-ShuffleTreat the entire chain as a flat bag of words and permute globally\.𝒮line\\mathcal\{S\}\_\{\\text\{line\}\}Line\-ShufflePermute newline\-delimited segments globally; within\-line content is preserved\.𝒮ilw\\mathcal\{S\}\_\{\\text\{ilw\}\}In\-line\-Word\-ShuffleIndependently permute words within each line; line order is preserved\.ℳα\\mathcal\{M\}\_\{\\alpha\}Mask\-AlphabetReplace all alphabetic characters with■\\blacksquare\.ℳν\\mathcal\{M\}\_\{\\nu\}Mask\-NumberReplace all numeric digits with■\\blacksquare\.ℳans\\mathcal\{M\}\_\{\\text\{ans\}\}Mask\-AnswerReplace all occurrences of the ground\-truth answer with■\\blacksquare\.ℛr\\mathcal\{R\}\_\{r\}Reasoning\-FreeOmit the reasoning chain entirely from the prompt\.ℛα\\mathcal\{R\}\_\{\\alpha\}Remove\-AlphabetStrip all alphabetic characters; retain digits, symbols, and whitespace\.ℛans\\mathcal\{R\}\_\{\\text\{ans\}\}Remove\-AnswerStrip all occurrences of the ground\-truth answer from the chain\.ℛq\\mathcal\{R\}\_\{q\}Remove\-QuestionOmit the original question from the answer\-generation prompt\.𝒩k\\mathcal\{N\}\_\{k\}Noise\-Injection\(k×k\\\!\\times\)Insert the false\-answer sentence atk×k\\timesthe frequency of the true answer\.𝒟tok\\mathcal\{D\}\_\{\\text\{tok\}\}Random\-TokenReplace each token with a uniformly sampled random token\.𝒟word\\mathcal\{D\}\_\{\\text\{word\}\}Random\-WordReplace each word with a word sampled according to the word\-frequency distribution of the reasoning chain\.
### 3\.2Experimental Setup
Our experimental pipeline proceeds in three stages\.
Stage 1 \- Reasoning chain collection\.Before any intervention, we run each model \([Section˜3\.4](https://arxiv.org/html/2605.07307#S3.SS4)\) on each benchmark \([Section˜3\.3](https://arxiv.org/html/2605.07307#S3.SS3)\) under its standard reasoning mode to collect model\-generated reasoning chains; these untransformed chains serve as the substrate for all subsequent transformations\.
Stage 2 \- Transformation and measurement\.To probe the hypothesis in[Section˜3\.1](https://arxiv.org/html/2605.07307#S3.SS1), we apply the processors listed in[Table˜1](https://arxiv.org/html/2605.07307#S3.T1)to each collected chain and measure answer accuracyAccuracy=\#correct/\#success\\text\{Accuracy\}=\\text\{\\\#correct\}/\\text\{\\\#success\}\. Processors compose sequentially—e\.g\.,𝒮line∘ℛα\\mathcal\{S\}\_\{\\text\{line\}\}\\circ\\mathcal\{R\}\_\{\\alpha\}—enabling systematic exploration of joint transformation effects\. The correctness criterion for each benchmark is described in[Section˜3\.3](https://arxiv.org/html/2605.07307#S3.SS3)\.
Stage 3 \- Evaluation\.We evaluate under two modes\. InFree Generation \(Gen\)mode, the model produces an answer without any structural constraint beyond the provided \(transformed\) reasoning chain\. InRetrieval \(Ret\)mode, a task\-specific completion prefix is appended—“Thus, the answer is”forAIME 2025andGPQA\-Diamond;“Thus, the code is\\n\`\`\`cpp\\n”forCodeElo—constraining the model to produce an immediate answer without further deliberation\. We adoptRetas the primary evaluation setting; the empirical motivation for this choice is developed in[Section˜4\.1](https://arxiv.org/html/2605.07307#S4.SS1), withGenresults included for reference\.
Baseline comparison\.Thebaselineapplies no transformation: the original reasoning chain is provided verbatim, and each transformed result is compared against it\. Baseline accuracies are reported in theOriginal\+Retrow of[Table˜2](https://arxiv.org/html/2605.07307#S4.T2)\.
Compute resources\.This work involves no model training; all experiments consist solely of model inference\. All inference calls are routed through the OpenRouter API222[https://openrouter\.ai/](https://openrouter.ai/), which provides unified access to the hosted models used in our study\.
### 3\.3Datasets
We evaluate primarily onAIME 2025\(Zhang and Math\-AI,[2025](https://arxiv.org/html/2605.07307#bib.bib33)\), comprising 30 high\-difficulty competition problems spanning algebra, geometry, combinatorics, and number theory\. All answers are integers in the range 0–999\. We sample 10 independent reasoning traces per problem \(300 total instances\) to reduce variance\. Correctness is judged by aGPT\-OSS\-120Bjudge that checks numerical equivalence between the extracted answer and the ground truth\.
CodeElo\(Quanet al\.,[2025](https://arxiv.org/html/2605.07307#bib.bib29)\)is a competitive programming benchmark drawn from Codeforces, covering problems across a wide range of difficulty ratings\. Each problem requires producing executable code that satisfies a formal specification; correctness is determined by execution against a fixed test suite, with no judge model required\.
GPQA\-Diamond\(Reinet al\.,[2023](https://arxiv.org/html/2605.07307#bib.bib28)\)is a graduate\-level science benchmark comprising expert\-curated multiple\-choice questions \(four options: A/B/C/D\) spanning biology, chemistry, and physics\. Questions are designed to resist lookup strategies and require multi\-step scientific reasoning; even domain experts achieve only around 65% accuracy, making it a stringent test of deep reasoning\. Correctness is judged by aGPT\-OSS\-120Bjudge that identifies the selected option from the model’s response\.
The three benchmarks cover three qualitatively distinct reasoning modalities—mathematical computation, algorithmic problem solving, and scientific inference—enabling us to assess the generalizability of our findings across domains\.
### 3\.4Models
We use three models to collect reasoning chains \([Section˜3\.2](https://arxiv.org/html/2605.07307#S3.SS2)\):OLMo\-3\.1\-32B\(Olmoet al\.,[2025](https://arxiv.org/html/2605.07307#bib.bib32)\),GPT\-OSS\-120B\(Agarwalet al\.,[2025](https://arxiv.org/html/2605.07307#bib.bib30)\), andDeepSeek\-V3\.1\-671B\(DeepSeek\-AI,[2024](https://arxiv.org/html/2605.07307#bib.bib31)\)\. Each model serves in two roles: as thereasoning generator\(producing chains via its built\-in reasoning mode\) and as theanswer extractor\(consuming the transformed chain to produce a final answer\)\. This paired design—each model generating and extracting from its own chains—isolates the effect of the transformation itself rather than any generator–extractor style mismatch\. We additionally compareOLMo\-3\.1\-32B\-Base\(pretrained only, without instruction tuning\) againstOLMo\-3\.1\-32Bto probe whether order\-independence originates from reasoning\-specific fine\-tuning or from pretraining\. All answer generation uses temperature 0\.5 and a maximum of 5,000 output tokens\.
## 4Results and Discussion
We report seven empirical findings organized around four questions: \(1\) which evaluation mode cleanly isolates answer extraction from independent reasoning \([Section˜4\.1](https://arxiv.org/html/2605.07307#S4.SS1)\); \(2\) whether extraction requires the reasoning chain to be presented in its original sequential order \([Section˜4\.2](https://arxiv.org/html/2605.07307#S4.SS2)\); \(3\) whether informational content is uniformly distributed across the chain or concentrated in sparse symbolic structures \([Section˜4\.3](https://arxiv.org/html/2605.07307#S4.SS3)\); and \(4\) how robust the identified extraction signal is to adversarial noise injection \([Section˜4\.4](https://arxiv.org/html/2605.07307#S4.SS4)\)\. All accuracy values are reported under the retrieval mode \(Ret\) setting unless otherwise noted; the motivation for this choice is developed in[Section˜4\.1](https://arxiv.org/html/2605.07307#S4.SS1)\.
### 4\.1Generation vs\. Retrieval Mode
We begin by motivating the choice of evaluation mode\. Our framework supports two settings: Free generation \(Gen\), in which the model freely continues its output after the reasoning chain, and retrieval mode \(Ret\), in which the model is constrained to complete a fixed answer\-extraction prefix \(e\.g\.,“Thus, the answer is”\)\. The distinction between these modes is consequential for interpreting any perturbation experiment\.
#### Finding 1: generation mode implicitly re\-reasons when the chain is absent or degraded—making retrieval mode the appropriate mode for isolating extraction\.
On theOriginal\(untransformed\) chain,GenandRetmodes yield nearly identical accuracies across all three models and benchmarks \([Table˜2](https://arxiv.org/html/2605.07307#S4.T2)\), confirming that both modes are equally constrained by the chain’s content when it is intact\. The two modes diverge dramatically, however, under theReasoning\-Freecondition \(ℛr\\mathcal\{R\}\_\{r\}\):OLMo\-3\.1\-32Breaches 28\.33% andDeepSeek\-V3\.1\-671Breaches 46\.33% onAIME 2025inGenmode*without any reasoning chain*, while both collapse to 0\.00% inRetmode\. This reveals a critical confound—generation mode allows the model to substitute its own reasoning when the chain is absent or degraded, conflating extraction with the model’s native problem\-solving capacity and making any measured gain uninterpretable\. We therefore adopt retrieval mode as the primary evaluation setting for all subsequent experiments, ensuring that all measured accuracy is attributable to the structure of the reasoning chain itself\.
Table 2:Accuracy under five shuffling granularities, two randomisation baselines \(𝒟tok\\mathcal\{D\}\_\{\\text\{tok\}\},𝒟word\\mathcal\{D\}\_\{\\text\{word\}\}\), and the reasoning\-free condition \(ℛr\\mathcal\{R\}\_\{r\}\), across three models \(OLMo,GPT\-OSS,DeepSeek\) and three benchmarks \(AIME 2025,CodeElo,GPQA\-Diamond\)\. Two evaluation modes are compared:Gen\(free generation\) andRet\(retrieval with an answer\-extraction prefix\)\. Accuracies are single\-run binomial estimates; SE≤\\leq±\\pm2\.9pp forAIME\(n=300n=300\),±\\pm2\.5pp forCodeElo\(n=408n=408\), and±\\pm3\.6pp forGPQA\-Diamond\(n=198n=198\)\.Table 3:Accuracy ofOLMo\-Base\(pretrained only\) versusOLMo\(instruction\-tuned\) under identical shuffling conditions across three benchmarks \(Retsetting\), probing whether order\-independence originates from pretraining or reasoning\-specific fine\-tuning\. SE bounds follow Table[2](https://arxiv.org/html/2605.07307#S4.T2)\.
### 4\.2Order Shuffling
With the evaluation mode established, we ask whether answer extraction requires the reasoning chain to be presented in its original sequential order\.[Table˜2](https://arxiv.org/html/2605.07307#S4.T2)reports accuracy under shuffling at five granularities across three models and three benchmarks\.
#### Finding 2: Reasoning chains are order\-shuffling at the line level, and shuffled chains provide genuine extractable signal\.
Randomly permuting the*lines*of a reasoning chain \(Line\-Shuffle,𝒮line\\mathcal\{S\}\_\{\\text\{line\}\}\)—thereby destroying all inter\-line ordering while preserving each line’s content—causes negligible accuracy loss\. On AIME 2025,OLMo\-3\.1\-32Bis unchanged at 78\.00%,GPT\-OSS\-120Bdrops by only 0\.39 pp \(91\.33%→\\to90\.94%\), andDeepSeek\-V3\.1\-671Bshows a slight*improvement*\. OnGPQA\-Diamond, bothGPT\-OSS\-120BandDeepSeek\-V3\.1\-671Brecord zero accuracy change under complete line\-order destruction\. The sequential ordering of multi\-step reasoning steps is almost entirely dispensable for answer extraction\.
Crucially, this near\-lossless shuffling is not because the shuffled chains are uninformative noise\. Comparing against two lower bounds—theReasoning\-Freebaseline \(ℛr\\mathcal\{R\}\_\{r\}, collapsing to 0\.00% forOLMo\-3\.1\-32BandDeepSeek\-V3\.1\-671BonAIME 2025\) and theRandom\-Tokenbaseline \(𝒟tok\\mathcal\{D\}\_\{\\text\{tok\}\}, reaching 0\.00%–12\.54%\)—shuffled chains remain far more useful than an absent chain or purely random token sequences\. Shuffled reasoning chains preservegenuine extractable signal; the sequential line order, by contrast, carries almost none of the information needed for extraction\.
#### Finding 3: Word\-level semantic units are the minimum necessary granularity; once this minimum meaning\-conveying unit is intact, ordering above it has limited impact on extraction\.
Finding 2 showed that line\-level order is dispensable; Finding 3 probes how far this order\-independence extends down the granularity hierarchy\. The key question is: at what level of representation does ordering begin to matter? Cross\-line word shuffling \(𝒮word\\mathcal\{S\}\_\{\\text\{word\}\}, which scrambles all words across the entire chain\) still yields substantially higher accuracy than the reasoning\-free baseline: 77\.67% forOLMo\-3\.1\-32B, 62\.21% forGPT\-OSS\-120B, and 89\.67% forDeepSeek\-V3\.1\-671BonAIME 2025\. Word\-level semantic content, even when globally disordered, continues to provide extractable signal—suggesting that once the minimum meaning\-conveying unit is preserved, positional structure above that level exerts limited influence on extraction\.
By contrast, token\-level shuffling \(𝒮tok\\mathcal\{S\}\_\{\\text\{tok\}\}\) is catastrophic: 1\.33% forOLMo\-3\.1\-32Band 3\.33% forGPT\-OSS\-120B, approaching the reasoning\-free floor\. The distinction between word shuffling and vocabulary\-level randomization further isolates the effect: theRandom\-Wordbaseline \(𝒟word\\mathcal\{D\}\_\{\\text\{word\}\}\), which replaces each word with a sample from the original word\-frequency distribution, yields 0\.00%–2\.67% onAIME 2025—indistinguishable from no reasoning\. Yet𝒮word\\mathcal\{S\}\_\{\\text\{word\}\}, which merely reorders the original words, retains 62\.21%–89\.67%\. The gap establishes thatword\-level semantic identity is the minimum necessary granularity: replacing words destroys all signal, while reordering them preserves most of it\. Any ordering structure above the word level—within\-line or across\-line—appears largely dispensable for answer extraction, at least within the settings we examine\.
#### Finding 4: Order\-independence is present in pretrained models, suggesting it originates from pretraining rather than reasoning\-specific fine\-tuning\.
[Table˜3](https://arxiv.org/html/2605.07307#S4.T3)comparesOLMo\-3\.1\-32B\-Base\(pretrained only\) againstOLMo\-3\.1\-32B\(instruction\-tuned\) under the same shuffling conditions\. Both models achieve nearly identical accuracy underLine\-Shuffle\(78\.67% vs\. 78\.00%\) andWord\-Shuffle\(78\.33% vs\. 77\.67%\) onAIME 2025\. Instruction tuning confers no additional advantage in handling disordered reasoning chains, and no additional penalty either\. Order\-independence is therefore not a product of chain\-of\-thought fine\-tuning but is instead rooted in the transformer architecture’s pretraining regime\.
Together, these results show that answer extraction is largelyorder\-independent above the word level, and that this property already exists in pretrained models\.
Table 4:Accuracy under combinations of five interventions—question removal \(ℛq\\mathcal\{R\}\_\{q\}\), alphabet masking \(ℳα\\mathcal\{M\}\_\{\\alpha\}\), digit masking \(ℳν\\mathcal\{M\}\_\{\\nu\}\), answer masking \(ℳans\\mathcal\{M\}\_\{\\text\{ans\}\}\), and line shuffling \(𝒮line\\mathcal\{S\}\_\{\\text\{line\}\}\)—onAIME 2025\(Math\) withGPT\-OSS\(Ret\), probing which components of a reasoning chain are informational\.∙\\bulletindicates the intervention is applied;∘\\circotherwise\. Rows are grouped by the \(ℛq\\mathcal\{R\}\_\{q\},𝒮line\\mathcal\{S\}\_\{\\text\{line\}\}\) configuration:A= neither,B=𝒮line\\mathcal\{S\}\_\{\\text\{line\}\}only,C=ℛq\\mathcal\{R\}\_\{q\}only,D= both; within each group, sub\-IDs 1–5 sweep the remaining three masking interventions\.Δ\\DeltaAcc is relative to the baseline row A1 \(91\.3%\)\. Light gray:ΔAcc<−25%\\Delta\\text\{Acc\}<\-25\\%; dark gray:ΔAcc<−60%\\Delta\\text\{Acc\}<\-60\\%\. SE bounds follow Table[2](https://arxiv.org/html/2605.07307#S4.T2)\.IDTransformationAcc\(%\)Δ\\DeltaAcc\(%\)ℛq\\mathcal\{R\}\_\{q\}ℳα\\mathcal\{M\}\_\{\\alpha\}ℳν\\mathcal\{M\}\_\{\\nu\}ℳans\\mathcal\{M\}\_\{\\text\{ans\}\}𝒮line\\mathcal\{S\}\_\{\\text\{line\}\}A1∘\\circ∘\\circ∘\\circ∘\\circ∘\\circ91\.3—A2∘\\circ∙\\bullet∘\\circ∘\\circ∘\\circ96\.0\+4\.7A3∘\\circ∘\\circ∘\\circ∙\\bullet∘\\circ73\.3\-18\.0A4∘\\circ∙\\bullet∘\\circ∙\\bullet∘\\circ68\.1\-23\.2A5∘\\circ∘\\circ∙\\bullet∙\\bullet∘\\circ0\.0\-91\.3B1∘\\circ∘\\circ∘\\circ∘\\circ∙\\bullet90\.9\-0\.4B2∘\\circ∙\\bullet∘\\circ∘\\circ∙\\bullet90\.6\-0\.7B3∘\\circ∘\\circ∘\\circ∙\\bullet∙\\bullet71\.3\-20\.0B4∘\\circ∙\\bullet∘\\circ∙\\bullet∙\\bullet51\.8\-39\.5B5∘\\circ∘\\circ∙\\bullet∙\\bullet∙\\bullet0\.0\-91\.3C1∙\\bullet∘\\circ∘\\circ∘\\circ∘\\circ89\.3\-2\.0C2∙\\bullet∙\\bullet∘\\circ∘\\circ∘\\circ92\.8\+1\.5C3∙\\bullet∘\\circ∘\\circ∙\\bullet∘\\circ63\.7\-27\.6C4∙\\bullet∙\\bullet∘\\circ∙\\bullet∘\\circ18\.8\-72\.5C5∙\\bullet∘\\circ∙\\bullet∙\\bullet∘\\circ0\.0\-91\.3D1∙\\bullet∘\\circ∘\\circ∘\\circ∙\\bullet91\.30\.0D2∙\\bullet∙\\bullet∘\\circ∘\\circ∙\\bullet54\.0\-37\.3D3∙\\bullet∘\\circ∘\\circ∙\\bullet∙\\bullet65\.67\-25\.6D4∙\\bullet∙\\bullet∘\\circ∙\\bullet∙\\bullet3\.62\-87\.7D5∙\\bullet∘\\circ∙\\bullet∙\\bullet∙\\bullet0\.0\-91\.3Table 5:Accuracy when all alphabetic characters are removed \(ℛα\\mathcal\{R\}\_\{\\alpha\}\), combined with answer removal \(ℛans\\mathcal\{R\}\_\{\\text\{ans\}\}\), line shuffling \(𝒮line\\mathcal\{S\}\_\{\\text\{line\}\}\), or word shuffling \(𝒮word\\mathcal\{S\}\_\{\\text\{word\}\}\), onAIME 2025\(Math\) withGPT\-OSS\(Ret\)\.Δ\\DeltaAcc is relative to the baseline \(91\.3%\)\. Light gray:ΔAcc<−25%\\Delta\\text\{Acc\}<\-25\\%; dark gray:ΔAcc<−60%\\Delta\\text\{Acc\}<\-60\\%\. Example chains under each transformation appear in TableRethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order\-Shuffling Chain\-of\-Thoughts\. SE bounds follow Table[2](https://arxiv.org/html/2605.07307#S4.T2)\.Table 6:Accuracy under false\-answer noise injection \(𝒩k\\mathcal\{N\}\_\{k\}\) onAIME 2025\(Math\) withGPT\-OSS\(Ret\)\. Rows cross six representation conditions: three shuffle states \(none,𝒮line\\mathcal\{S\}\_\{\\text\{line\}\},𝒮word\\mathcal\{S\}\_\{\\text\{word\}\}\) crossed with alphabet removal \(ℛα\\mathcal\{R\}\_\{\\alpha\}\) on/off; columns vary the noise multiplier from0×0\\timesto3×3\\timesthe true\-answer occurrence count\. SE bounds follow Table[2](https://arxiv.org/html/2605.07307#S4.T2)\.
### 4\.3Information Sparsity
Having established that sequential order does not matter, we next ask which*parts*of the reasoning chain carry information\. We investigate whether the chain’s content is uniformly necessary or concentrated in specific symbolic structures\.
#### Finding 5: Informational content is non\-uniformly distributed—alphabetic prose is redundant, the explicit answer occurrence is a strong extraction anchor, and numeric content is irreducible\.
[Table˜4](https://arxiv.org/html/2605.07307#S4.T4)reveals a striking asymmetry in the information structure of mathematical reasoning chains\. Masking all alphabetic characters \(ℳα\\mathcal\{M\}\_\{\\alpha\}, row A2\) does not reduce accuracy; it*increases*it by 4\.7 pp \(91\.3%→\\to96\.0%\), indicating that the model relies on numeric and symbolic structures rather than on natural\-language prose to locate the answer\.[Table˜5](https://arxiv.org/html/2605.07307#S4.T5)confirms this:*removing*all alphabetic characters \(ℛα\\mathcal\{R\}\_\{\\alpha\}\) costs only 1\.3 pp \(91\.3%→\\to90\.0%\) while substantially compressing the token count\.
The answer occurrence, by contrast, plays a decisive role\. Masking only the answer token\(s\) \(Mask\-Answer, row A3\) causes an 18\.0 pp drop \(91\.3%→\\to73\.3%\), showing that the explicit chain\-final answer functions as a primary extraction anchor\. The model can partially recover from surrounding numeric context, but in 18\.0 pp of cases the chain\-final answer is the decisive cue\.
Numeric content anchors the entire information hierarchy\. Masking all numeric digits \(ℳν\\mathcal\{M\}\_\{\\nu\}, rows A5/B5/C5/D5\) collapses accuracy to 0\.0% in every condition, confirming that numeric information is both necessary and sufficient for answer extraction in mathematical reasoning\. In short: alphabetic text is redundant, the answer occurrence is a strong anchor, and numbers are irreducible\.
#### Finding 6: The non\-uniformly distributed informational content is sufficient on its own—perturbing its positional arrangement causes only minimal signal loss\.
Finding 5 established a non\-uniform distribution of informational content: alphabetic prose is redundant, while numeric content and the explicit answer occurrence are the load\-bearing components\. Finding 6 asks whether this sparse substrate remains sufficient once its positional structure is destroyed\. Row B2 of[Table˜4](https://arxiv.org/html/2605.07307#S4.T4)appliesMask\-AlphabetandLine\-Shufflejointly: accuracy reaches 90\.6%, only−\-0\.7 pp from baseline\. Stripping the prose and then permuting the remaining content causes virtually no additional loss, demonstrating that it is the*content*, not its arrangement, that supports extraction\.
TheRemove\-Alphabetablation \([Table˜5](https://arxiv.org/html/2605.07307#S4.T5)\) corroborates this under an even stronger perturbation\. Deleting all alphabetic characters and then shuffling lines \(𝒮line∘ℛα\\mathcal\{S\}\_\{\\text\{line\}\}\\circ\\mathcal\{R\}\_\{\\alpha\}, row 4 of[Table˜5](https://arxiv.org/html/2605.07307#S4.T5)\) yields 83\.3%—only−\-8\.0 pp from the unperturbed baseline\. A reasoning chain stripped of all natural language and presented in arbitrary line order still extracts the correct answer in 83% of cases\. The question prompt is similarly dispensable: removing it \(rows C1, D1 of[Table˜4](https://arxiv.org/html/2605.07307#S4.T4)\) costs at most 2\.0 pp, and combining its removal with line shuffling \(D1, 91\.3%\) incurs zero loss\.
Together, Findings 2–6 converge on a unified account: answer extraction isorder\-independent above the word leveland is governed by thenon\-uniform informational content\. In short, what matters is*which*symbolic content is present, not*where*it appears\.
### 4\.4Noise Robustness
Finding 6 showed that the non\-uniformly distributed informational content—numeric computations and the explicit answer occurrence—forms a self\-sufficient extraction substrate whose signal survives arbitrary reordering\. A complementary question is whether this signal is also robust to adversarial contamination: when the true answer is still present but accompanied by an overwhelming number of competing false answers, can the model be misdirected away from it? To test this, we inject the spurious sentence“Thus answer: 123\.”at1×1\\times,2×2\\times, and3×3\\timesthe true\-answer occurrence count \(𝒩k=1\\mathcal\{N\}\_\{k=1\},𝒩k=2\\mathcal\{N\}\_\{k=2\},𝒩k=3\\mathcal\{N\}\_\{k=3\}\) under the four representation conditions reported in[Table˜6](https://arxiv.org/html/2605.07307#S4.T6)\.
#### Finding 7: The structural signal identified in Finding 6 is intrinsically robust—even strong false\-answer injection cannot override it\.
Across every representation condition that preserves numeric content, accuracy stays essentially unchanged when false answers are injected at up to3×3\\timesthe true\-answer frequency\. The unperturbed baseline holds at 91\.3% across all noise levels \(0×0\\timesthrough3×3\\times\); theRemove\-Alphabetchain remains near 90%, and the line\-shuffled chain decays only mildly \(90\.9%→\\to86\.3% at3×3\\times\)\. Most strikingly, even the most aggressively reduced representation from Finding 6—the line\-shuffledRemove\-Alphabetchain, retaining only a disordered numeric skeleton—stays flat at 83\.3% regardless of noise multiplier\.
This robustness stands in stark contrast to the model’s sensitivity to*removing*the true answer \(Finding 5\): the model is highly sensitive to the*absence*of the correct answer, yet entirely insensitive to a*surplus*of competing false ones\. If extraction relied on simple statistical frequency—selecting whichever candidate appears most often—then injection at2×2\\timesor3×3\\timesfrequency should systematically redirect the model to the noise token\. The observed immunity falsifies this account\. Instead, the non\-uniform informational content of the reasoning chain—the numeric computations, relational equalities, and symbolic derivations identified in Findings 5 and 6—forms astructurally coherent scaffoldwith strong intrinsic signal: these relational patterns collectively constrain the model to the correct answer and actively suppress spurious candidates, even when false answers numerically dominate the context by frequency\.
### 4\.5Theoretical Conjecture
The pattern is consistent with transformers processing reasoning chains as approximately unordered sets of symbolic constraints, attending selectively to numeric values and relational structures while treating prose and positional ordering as low\-information context\. Because self\-attention is architecturally near\-permutation\-equivariant up to positional encoding, we conjecture that*relational*rather than*positional*structure governs extraction—a property potentially rooted in pretraining \(Finding 4\) and robust to adversarial noise \(Finding 7\)\. This robustness may arise because the relational scaffold suppresses spurious candidates through implicit constraint satisfaction over the numeric subspace\. However, this interpretation remains speculative and requires further experimental validation, which we leave to future work\.
## 5Conclusion
Modern reasoning language models generate dense, sequential chain\-of\-thought traces under two implicit but untested assumptions—that every token contributes \(*density*\) and that steps must be consumed in order \(*order*\)\. We asked what information the answer\-*extraction*stage actually relies on, and answered empirically through a systematic intervention pipeline—removal, masking, shuffling, and noise injection\.
Our seven findings overturn both assumptions along three converging dimensions\.*Order*: line\-level shuffling reduces accuracy by less than 0\.5 pp, word\-level shuffling still retains 62%–89%, and pretrained\-only and instruction\-tuned variants tolerate shuffling near\-identically—indicating that order\-independence originates from pretraining rather than reasoning\-specific fine\-tuning\.*Dense*: masking alphabetic prose*improves*accuracy by 4\.7 pp, the answer is a strong anchor \(an 18\.0 pp drop when masked alone\), and masking digits collapses accuracy to 0%\.*Robustness*: a chain stripped of all natural language and shuffled into arbitrary line order still yields 83%, and false\-answer injection at3×3\\timesfrequency leaves accuracy unchanged, falsifying a frequency\-based extraction account\. Together, these results establish that answer extraction operates on asparse, order\-shuffling, and structurally robust informational substrate, opening concrete paths toward parallelized and token\-efficient reasoning\.
Limitations\.This work probes only answer*extraction*; whether generation can be made commensurately sparse or parallel is an open engineering question we motivate but do not resolve\. Sparsity findings rest primarily onAIME 2025, whose numeric\-essential structure is inherent to mathematical notation and may differ for code or scientific reasoning\. Order\-independence holds strictly above the word level, and word\-shuffle tolerance varies non\-trivially across models\. Our theoretical conjecture is consistent with the data; however, it lacks mechanistic verification and requires further experimental validation\.
Impact Statement\.The primary benefit is more token\-efficient and parallelizable inference, lowering deployment cost\. We foresee no direct harmful applications of this research\.
## References
- S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§3\.4](https://arxiv.org/html/2605.07307#S3.SS4.p1.1)\.
- DeepSeek\-AI \(2024\)DeepSeek\-V3 Technical Report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[§3\.4](https://arxiv.org/html/2605.07307#S3.SS4.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.07307#S1.p1.1)\.
- T\. Lanham, A\. Chen, A\. Radhakrishnan, N\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion,et al\.\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.arXiv preprint arXiv:2307\.13702\.Cited by:[§2](https://arxiv.org/html/2605.07307#S2.p1.1)\.
- A\. Madaan and A\. Yazdanbakhsh \(2022\)Text and patterns: for effective chain of thought, it takes two to tango\.arXiv preprint arXiv:2209\.07686\.Cited by:[§2](https://arxiv.org/html/2605.07307#S2.p1.1)\.
- S\. Min, X\. Lyu, A\. Holtzman, M\. Artetxe, M\. Lewis, H\. Hajishirzi, and L\. Zettlemoyer \(2022\)Rethinking the role of demonstrations: what makes in\-context learning work?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 11048–11064\.Cited by:[§2](https://arxiv.org/html/2605.07307#S2.p1.1)\.
- T\. Olmo, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison,et al\.\(2025\)Olmo 3\.arXiv preprint arXiv:2512\.13961\.Cited by:[§3\.4](https://arxiv.org/html/2605.07307#S3.SS4.p1.1)\.
- OpenAI \(2024\)Learning to reason with llms\.OpenAI Blog\.External Links:[Link](https://openai.com/o1/)Cited by:[§1](https://arxiv.org/html/2605.07307#S1.p1.1)\.
- J\. Pfau, W\. Merrill, and S\. R\. Bowman \(2024\)Let’s think dot by dot: hidden computation in transformer language models\.arXiv preprint arXiv:2404\.15758\.Cited by:[§2](https://arxiv.org/html/2605.07307#S2.p1.1)\.
- S\. Quan, J\. Yang, B\. Yu, B\. Zheng, D\. Liu, A\. Yang, X\. Ren, B\. Gao, Y\. Miao, Y\. Feng,et al\.\(2025\)CodeElo: benchmarking competition\-level code generation of llms with human\-comparable elo ratings\.arXiv preprint arXiv:2501\.01257\.Cited by:[§3\.3](https://arxiv.org/html/2605.07307#S3.SS3.p2.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: a graduate\-level google\-proof q&a benchmark\.External Links:2311\.12022Cited by:[§3\.3](https://arxiv.org/html/2605.07307#S3.SS3.p3.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.07307#S1.p1.1)\.
- Y\. Zhang and T\. Math\-AI \(2025\)American invitational mathematics examination \(aime\) 2025\.Cited by:[§3\.3](https://arxiv.org/html/2605.07307#S3.SS3.p1.1)\.
Table 7:Example reasoning chains for the baseline and the five transformations in Table[5](https://arxiv.org/html/2605.07307#S4.T5), applied to the question:Find the sum of all integer basesb\>9b\>9for which17b17\_\{b\}is a divisor of97b97\_\{b\}\.Correct answer highlighted inred\.
Baseline
We are asked: Find the sum of all integer bases b\>\>9 for which 17\_b divides 97\_b\.
Interpretation: In base b, the numeral "17" means the integer 1\*b \+ 7 = b \+ 7\. Similarly, "97" in base b means 9\*b \+ 7 = 9b \+ 7\. So the condition is that \(b \+ 7\) divides \(9b \+ 7\)\. For b\>\>9 \(since digits must be less than base, digits used are 1,7,9,7\. So base must be\>\>9 to allow digit "9"\)\.
We need integer bases b\>\>9 such that \(b \+ 7\)\|\|\(9b \+ 7\)\. Find all b\>\>9 integer solutions\.
So we solve \(9b \+ 7\) mod \(b \+ 7\) = 0\.
Compute remainder: 9b \+ 7 = 9\(b \+ 7\) \- 9\*7 \+ 7 = 9\(b \+ 7\) \- 63 \+ 7 = 9\(b \+ 7\) \- 56\.
Thus remainder when dividing by b \+ 7 is \-56 \(or equivalently b\+7 \- 56 mod\)\. So we want b\+7 divides 56\. Since remainder is \-56, we need b\+7 divides 56\. But note that remainder of division is 9b\+7 \- 9\(b\+7\) = \-56\. So 9b\+7 = 9\(b\+7\) \- 56\. If b\+7 divides 9\(b\+7\) obviously, then b\+7 will also divide the remainder \-56\. So condition: b\+7 divides 56\. But also b\+7\>\>0\.
Thus b\+7 is a positive divisor of 56\.
Since b\>\>9 =\>\>b\+7\>\>16\. So positive divisors of 56 greater than 16\.
Divisors of 56: 1,2,4,7,8,14,28,56\.
Those\>\>16: 28,56\.
Thus b\+7 can be 28 or 56\.
So b = 21 or 49\. Both\>\>9, integer\.
Check base digits constraints: need digits 9,7 allowed\. Base must be\>\>9, which both satisfy\.
Thus the required sum is 21\+49 =70\.
Wait check also b\+7 could be negative? No base is positive integer\>\>9\.
Thus answer70\.
But perhaps also b\+7 dividing \-56 \(which is same\)\. But we also need to ensure that b\+7 indeed divides 9b\+7 \(i\.e\., remainder zero\)\. Our condition implies b\+7 divides 56\. Since \-56 remainder is multiple of divisor; but careful: remainder must be 0, i\.e\., \-56 should be multiple of b\+7\. So b\+7 is a divisor of 56\. Should \-56 be exactly divisible: b\+7\|\|56\. So that’s the condition\.
Hence solutions as above\.
Double\-check those bases indeed work:
b=21: 17\_21 = 21\+7=28; 97\_21 = 9\*21\+7 = 189\+7 = 196\. 196/28 = 7 \-\>\>integer\.
b=49: 17\_49 = 49\+7 = 56; 97\_49 = 9\*49\+7 = 441\+7=448\. 448/56 =8 \-\>\>integer\.
Thus sum =70\.
Thus answer is70\.
But could there be any other possibilities? Let’s check for b\+7 dividing 56 also could be 1? But b\+7 must be\>\>16 as we said\. So only 28,56\. Verified\.
Thus answer70\.
However, is there any catch that digits must be less than base, but base 21 and 49 satisfy digits max 9<=<=base\-1\.
Thus answer stands\.
Thus solution:70\.
In final answer format:70\.
Will provide reasoning\.
Proceed to final\.
ℛα\\mathcal\{R\}\_\{\\alpha\}
:\>\>9 17\_ 97\_ \. : , "17" 1\* \+ 7 = \+ 7\. , "97" 9\* \+ 7 = 9 \+ 7\. \( \+ 7\) \(9 \+ 7\)\.\>\>9 \( , 1,7,9,7\.\>\>9 "9"\)\.\>\>9 \( \+ 7\)\|\|\(9 \+ 7\)\.\>\>9 \. \(9 \+ 7\) \( \+ 7\) = 0\. : 9 \+ 7 = 9\( \+ 7\) \- 9\*7 \+ 7 = 9\( \+ 7\) \- 63 \+ 7 = 9\( \+ 7\) \- 56\. \+ 7 \-56 \( \+7 \- 56 \)\. \+7 56\. \-56, \+7 56\. 9 \+7 \- 9\( \+7\) = \-56\. 9 \+7 = 9\( \+7\) \- 56\. \+7 9\( \+7\) , \+7 \-56\. : \+7 56\. \+7\>\>0\. \+7 56\.\>\>9 =\>\>\+7\>\>16\. 56 16\. 56: 1,2,4,7,8,14,28,56\.\>\>16: 28,56\. \+7 28 56\. = 21 49\.\>\>9, \. : 9,7 \.\>\>9, \. 21\+49 =70\. \+7 ?\>\>9\.70\. \+7 \-56 \( \)\. \+7 9 \+7 \( \. \., \)\. \+7 56\. \-56 ; : 0, \. \., \-56 \+7\. \+7 56\. \-56 : \+7\|\|56\. ’ \. \. \- : =21: 17\_21 = 21\+7=28; 97\_21 = 9\*21\+7 = 189\+7 = 196\. 196/28 = 7 \-\>\>\. =49: 17\_49 = 49\+7 = 56; 97\_49 = 9\*49\+7 = 441\+7=448\. 448/56 =8 \-\>\>\. =70\.70\. ? ’ \+7 56 1? \+7\>\>16 \. 28,56\. \.70\. , , 21 49 9<=<=\-1\. \. :70\. :70\. \. \.
ℛα∘ℛans\\mathcal\{R\}\_\{\\alpha\}\\circ\\mathcal\{R\}\_\{\\text\{ans\}\}
:\>\>9 17\_ 97\_ \. : , "17" 1\* \+ 7 = \+ 7\. , "97" 9\* \+ 7 = 9 \+ 7\. \( \+ 7\) \(9 \+ 7\)\.\>\>9 \( , 1,7,9,7\.\>\>9 "9"\)\.\>\>9 \( \+ 7\)\|\|\(9 \+ 7\)\.\>\>9 \. \(9 \+ 7\) \( \+ 7\) = 0\. : 9 \+ 7 = 9\( \+ 7\) \- 9\*7 \+ 7 = 9\( \+ 7\) \- 63 \+ 7 = 9\( \+ 7\) \- 56\. \+ 7 \-56 \( \+7 \- 56 \)\. \+7 56\. \-56, \+7 56\. 9 \+7 \- 9\( \+7\) = \-56\. 9 \+7 = 9\( \+7\) \- 56\. \+7 9\( \+7\) , \+7 \-56\. : \+7 56\. \+7\>\>0\. \+7 56\.\>\>9 =\>\>\+7\>\>16\. 56 16\. 56: 1,2,4,7,8,14,28,56\.\>\>16: 28,56\. \+7 28 56\. = 21 49\.\>\>9, \. : 9,7 \.\>\>9, \. 21\+49 = \. \+7 ?\>\>9\. \. \+7 \-56 \( \)\. \+7 9 \+7 \( \. \., \)\. \+7 56\. \-56 ; : 0, \. \., \-56 \+7\. \+7 56\. \-56 : \+7\|\|56\. ’ \. \. \- : =21: 17\_21 = 21\+7=28; 97\_21 = 9\*21\+7 = 189\+7 = 196\. 196/28 = 7 \-\>\>\. =49: 17\_49 = 49\+7 = 56; 97\_49 = 9\*49\+7 = 441\+7=448\. 448/56 =8 \-\>\>\. = \. \. ? ’ \+7 56 1? \+7\>\>16 \. 28,56\. \. \. , , 21 49 9<=<=\-1\. \. : \. : \. \. \.
𝒮line∘ℛα\\mathcal\{S\}\_\{\\text\{line\}\}\\circ\\mathcal\{R\}\_\{\\alpha\}
=21: 17\_21 = 21\+7=28; 97\_21 = 9\*21\+7 = 189\+7 = 196\. 196/28 = 7 \-\>\>\. \+7 ?\>\>9\. \+7 28 56\.70\. :70\. \+7 56\. \+ 7 \-56 \( \+7 \- 56 \)\. \+7 56\. \-56, \+7 56\. 9 \+7 \- 9\( \+7\) = \-56\. 9 \+7 = 9\( \+7\) \- 56\. \+7 9\( \+7\) , \+7 \-56\. : \+7 56\. \+7\>\>0\. : 9,7 \.\>\>9, \. = 21 49\.\>\>9, \.70\.\>\>16: 28,56\. \. \.70\. \+7 \-56 \( \)\. \+7 9 \+7 \( \. \., \)\. \+7 56\. \-56 ; : 0, \. \., \-56 \+7\. \+7 56\. \-56 : \+7\|\|56\. ’ \. :70\. : , "17" 1\* \+ 7 = \+ 7\. , "97" 9\* \+ 7 = 9 \+ 7\. \( \+ 7\) \(9 \+ 7\)\.\>\>9 \( , 1,7,9,7\.\>\>9 "9"\)\. 21\+49 =70\. \- :\>\>9 \( \+ 7\)\|\|\(9 \+ 7\)\.\>\>9 \. \. =70\. \. : 9 \+ 7 = 9\( \+ 7\) \- 9\*7 \+ 7 = 9\( \+ 7\) \- 63 \+ 7 = 9\( \+ 7\) \- 56\. , , 21 49 9<=<=\-1\.\>\>9 =\>\>\+7\>\>16\. 56 16\. 56: 1,2,4,7,8,14,28,56\. ? ’ \+7 56 1? \+7\>\>16 \. 28,56\. \. :\>\>9 17\_ 97\_ \. \(9 \+ 7\) \( \+ 7\) = 0\. =49: 17\_49 = 49\+7 = 56; 97\_49 = 9\*49\+7 = 441\+7=448\. 448/56 =8 \-\>\>\.
𝒮line∘ℛα∘ℛans\\mathcal\{S\}\_\{\\text\{line\}\}\\circ\\mathcal\{R\}\_\{\\alpha\}\\circ\\mathcal\{R\}\_\{\\text\{ans\}\}
=21: 17\_21 = 21\+7=28; 97\_21 = 9\*21\+7 = 189\+7 = 196\. 196/28 = 7 \-\>\>\. \+7 ?\>\>9\. \+7 28 56\. \. : \. \+7 56\. \+ 7 \-56 \( \+7 \- 56 \)\. \+7 56\. \-56, \+7 56\. 9 \+7 \- 9\( \+7\) = \-56\. 9 \+7 = 9\( \+7\) \- 56\. \+7 9\( \+7\) , \+7 \-56\. : \+7 56\. \+7\>\>0\. : 9,7 \.\>\>9, \. = 21 49\.\>\>9, \. \.\>\>16: 28,56\. \. \. \. \+7 \-56 \( \)\. \+7 9 \+7 \( \. \., \)\. \+7 56\. \-56 ; : 0, \. \., \-56 \+7\. \+7 56\. \-56 : \+7\|\|56\. ’ \. : \. : , "17" 1\* \+ 7 = \+ 7\. , "97" 9\* \+ 7 = 9 \+ 7\. \( \+ 7\) \(9 \+ 7\)\.\>\>9 \( , 1,7,9,7\.\>\>9 "9"\)\. 21\+49 = \. \- :\>\>9 \( \+ 7\)\|\|\(9 \+ 7\)\.\>\>9 \. \. = \. \. : 9 \+ 7 = 9\( \+ 7\) \- 9\*7 \+ 7 = 9\( \+ 7\) \- 63 \+ 7 = 9\( \+ 7\) \- 56\. , , 21 49 9<=<=\-1\.\>\>9 =\>\>\+7\>\>16\. 56 16\. 56: 1,2,4,7,8,14,28,56\. ? ’ \+7 56 1? \+7\>\>16 \. 28,56\. \. :\>\>9 17\_ 97\_ \. \(9 \+ 7\) \( \+ 7\) = 0\. =49: 17\_49 = 49\+7 = 56; 97\_49 = 9\*49\+7 = 441\+7=448\. 448/56 =8 \-\>\>\.
𝒮word∘ℛα\\mathcal\{S\}\_\{\\text\{word\}\}\\circ\\mathcal\{R\}\_\{\\alpha\}
\+ \- 9 \( = 49\+7 \)\. ? 448/56 9 7\)\. 1,2,4,7,8,14,28,56\. \+7 \-56 \-56\.\|\|9\* 7\)\. ,\>\>56: 196\. 0\. 7 1,7,9,7\. 17\_21 = \( 56 =21: 9\*21\+7 = \+ 7\) \- \+7 \)\. \-56 : =\>\>16\. \+7 7\) \+7 \(\>\>\( \. \., \+ 56\. \. \., , 17\_49 1? = 189\+7 \. \- = : \+ \+ : : \+ 0, 16: 56\. 56\. 16 9\( \+7\) = = \-56 \. \- 21 \+7 21\+7=28; 7 97\_ \.\>\>9\*7 \+7 7 ; 9\( \+7\) 7\) 56\. \+7 56\. = 56\. \+\|\|’\>\>9,70\. \+ =70\. 49\.70\. \+7 7\) =70\.70\.<=<=97\_49 =8 9\>\>\. 56\. 97\_21\>\>,\>\>56\. \+7\. = \)\. 7 \. : 7\.\>\>9, 7 : \-\>\>\( =\>\>49 \+7 56\. , "9"\)\. \. \(\>\>770\. \+ \+ 9 \( 28,56\. 9 \+7 196/28 \+ 9 9\*49\+7 9 28,56\. =49: \+ 7\) : 56 = 56\. \+770\. \. "17" \+ \+7 = , 9\( 21 9\( : 16\. \+7 28 ’ \+7 \+ \- 56 \-1\. 7\) 1\* \+7 \. 63 441\+7=448\. 56; 9,7 = \-56 "97" \-56 7\) : \. \-56\. \- 9 \+7 \+7 ? = \+ , \(9 9 \+7 9\( \+7\) 9 \. \. \. \+ \+7 \- 9 \-56, =\>\>\-\>\>= 9\(\>\>\+ \+7 : 21\+49 0\. \(9 7\. 17\_ \+7\>\>9\. = \. \. = 7 9 \(9Similar Articles
Reasoning models struggle to control their chains of thought, and that’s good
OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.
A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
This academic paper develops a theoretical framework for online learning with autoregressive chain-of-thought reasoning, analyzing mistake bounds under end-to-end and trajectory supervision models.
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.
@dbreunig: Reasoning models are great at understanding nuance and natural language. This nuance hasn't trickled down to retrieval …
A tweet highlights that while reasoning models excel at nuance and natural language understanding, this capability hasn't translated to retrieval systems, pointing to a key bottleneck in AI.