More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

arXiv cs.AI Papers

Summary

This research paper investigates position bias in reasoning models, finding that bias scales with the length of the reasoning trajectory rather than being eliminated by 'more thinking.' The study provides causal evidence and a diagnostic toolkit for auditing this length-driven bias in multiple-choice QA evaluations.

arXiv:2605.06672v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory. Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles. A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism. We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:04 AM

# Length-Driven Position Bias in Reasoning Models
Source: [https://arxiv.org/html/2605.06672](https://arxiv.org/html/2605.06672)
## More Thinking, More Bias: Length\-Driven Position Bias in Reasoning Models

###### Abstract

Chain\-of\-thought \(CoT\) reasoning and reasoning\-tuned models such as DeepSeek\-R1 are commonly assumed to reduce shallow, heuristic biases by thinking carefully\. We test this on*position bias*in multiple\-choice QA and find a different story: within any reasoning\-capable model, per\-question position bias scales with the length of the reasoning trajectory\. Across thirteen reasoning\-mode configurations \(two R1\-distilled 7–8B models, two base models prompted with CoT, and DeepSeek\-R1 at 671B\) on MMLU, ARC\-Challenge, and GPQA, twelve show a positive partial correlationρ​\(length,PBS∣accuracy\)\\rho\(\\text\{length\},\\textsc\{PBS\}\\mid\\text\{accuracy\}\), ranging from 0\.11 to 0\.41 \(allp<0\.05p<0\.05\)\. All twelve open\-weight reasoning\-mode configurations show monotonically increasingPBSacross length quartiles; a truncation intervention provides causal evidence that continuations from later points in the trajectory are increasingly likely to shift toward position\-preferred options \(16%→\\rightarrow32% for R1\-Qwen\-7B\)\. At 671B, aggregatePBScollapses to 0\.019, but the length effect still manifests in the longest quartile \(PBS= 0\.071\), suggesting accuracy gates the*expression*of length\-driven bias rather than eliminating the underlying mechanism\. We additionally find that*direct\-answer*position bias is a distinct phenomenon with a different footprint \(strong in Llama\-Instruct\-direct, weak in Qwen\-Instruct\-direct, and uncorrelated with trajectory length\): CoT reasoning replaces this*baseline bias*with*length\-accumulated bias*\. Our results argue that reasoning\-capable models should not be treated as order\-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit \(PBS,CCP, effective switching, truncation probes\) for auditing position bias in reasoning models\.

## 1Introduction

Reasoning\-tuned language models — OpenAI’s o\-series, DeepSeek\-R1\(DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2605.06672#bib.bib2)\), Qwen’s QwQ family\(Qwen Team,[2024](https://arxiv.org/html/2605.06672#bib.bib6)\), and their distilled derivatives — are routinely promoted as models that “think longer to answer better\.” A natural corollary of this narrative is that extended thinking should also*reduce*shallow, heuristic biases\.Position biasin multiple\-choice QA is a canonical such heuristic: an unbiased model’s answer distribution should be invariant to the ordering of answer choices, yet prior work has repeatedly documented that LLMs disproportionately select options at particular positions\(Zheng et al\.,[2023](https://arxiv.org/html/2605.06672#bib.bib10); Pezeshkpour and Hruschka,[2024](https://arxiv.org/html/2605.06672#bib.bib5); Wang et al\.,[2024](https://arxiv.org/html/2605.06672#bib.bib8)\)\. If “more thinking” means “less shortcut\-taking,” we would expect reasoning models to be*less*position\-biased than their non\-reasoning counterparts\.

We find that the relationship is more nuanced, and hinges on trajectory length\.Across matched pairs of reasoning\-tuned and Instruct base models on MMLU, ARC\-Challenge, and GPQA, two phenomena stand out\.

- •Within reasoning trajectories, position bias scales with length\.Per\-question Position Bias Score \(PBS\) correlates positively with mean trajectory length after controlling for accuracy, in 12 of 13 reasoning\-mode configurations tested \(ρ=0\.11\\rho=0\.11–0\.410\.41, allp<0\.05p<0\.05\)\. Binning questions into length quartiles yields a*monotonically increasing*PBSfrom shortest to longest quartile in all 12 open\-weight reasoning\-mode configurations\. A truncation intervention confirms the effect is causal: continuations resumed from later points in the trajectory are increasingly likely to shift toward position\-preferred options \(from 16% to 32% for R1\-Qwen\-7B across absolute\-position buckets\)\.
- •Direct\-answer bias is a separate phenomenon\.In the Llama pair, Instruct\-direct exhibits extreme baseline position bias \(PBS= 0\.40/0\.26/0\.61 on MMLU/ARC/GPQA\) that is essentially uncorrelated with trajectory length\. CoT reasoning*replaces*this baseline bias with length\-accumulated bias: for Llama, CoT reducesPBS; for Qwen, whose Instruct\-direct baseline is already mild, CoT does not reduce but subsequent length extension \(R1\) increases it\. Our length\-driven claim therefore concerns reasoning trajectories specifically, not position bias in general\.

To validate the length mechanism across scales, we evaluate DeepSeek\-R1 at 671B parameters on MMLU\. AggregatePBSdrops to 0\.019 \(from 0\.21 at 7–8B\), but the length\-quartile pattern persists:PBSis essentially zero on the first three quartiles \(short\- and medium\-length trajectories\) and 0\.071 on the longest quartile\. The commitment\-timing signature \(CCP\) is essentially invariant to scale \(0\.73 vs\. 0\.75\)\. We interpret this as evidence that*accuracy gates the expression of length\-driven bias*rather than eliminating the underlying mechanism\.

#### Contributions\.

- C1\.Within reasoning trajectories, per\-questionPBSscales with length, controlling for accuracy\. This holds across R1\-distilled, Instruct\-CoT, and API\-scale reasoning models on three MCQ benchmarks\.
- C2\.A truncation intervention provides causal evidence: later truncations of a trajectory produce more position\-preferred answer shifts, in a monotonic*accumulated\-exposure*pattern\.
- C3\.A cross\-scale validation at 671B shows that the length\-driven mechanism persists at scale, while aggregatePBSis modulated by accuracy\.
- C4\.A*two\-source*characterization of position bias: baseline bias \(direct mode, base\-model\-specific, length\-independent\) is distinct from length\-driven bias \(reasoning mode, universal, length\-dependent\); CoT reasoning replaces the former with the latter\.
- C5\.A diagnostic toolkit \(PBS,CCP, effective switching, truncation probes\) for auditing position bias in reasoning models, with code and data released\.

#### Why this matters\.

Reasoning\-capable models are increasingly deployed as judges, graders, and decision\-support systems where order\-robustness is a silent requirement\. Our results argue that extending reasoning length is not a free lunch for bias: practitioners should not assume that longer CoT outputs are*more*order\-invariant than shorter ones\.

## 2Related Work

#### Position bias in LLM evaluation\.

Position bias has been documented across scales, training regimes, and prompt formats\(Zheng et al\.,[2023](https://arxiv.org/html/2605.06672#bib.bib10); Wang et al\.,[2024](https://arxiv.org/html/2605.06672#bib.bib8); Pezeshkpour and Hruschka,[2024](https://arxiv.org/html/2605.06672#bib.bib5)\)\. Most proposed mitigations treat the bias as a uniform property of the model \(e\.g\., via option\-permutation averaging\) rather than a trajectory\-dependent phenomenon\.

#### Reasoning\-tuned language models\.

DeepSeek\-R1\(DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2605.06672#bib.bib2)\), the o\-series, and QwQ\(Qwen Team,[2024](https://arxiv.org/html/2605.06672#bib.bib6)\)are trained to produce extended internal reasoning before a final answer\. Distilled variants \(R1\-Distill\-Qwen, R1\-Distill\-Llama\)\(DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2605.06672#bib.bib2)\)transfer this behavior to smaller base models\. Few works have audited how reasoning\-style training alters the*bias profile*inherited from the base model\.

#### Bias amplification in chain of thought\.

Wu et al\. \([2025](https://arxiv.org/html/2605.06672#bib.bib9)\)show that social bias intensifies across reasoning steps on BBQ: a biased step early in the chain tends to be sustained and amplified rather than corrected\.Luo et al\. \([2025](https://arxiv.org/html/2605.06672#bib.bib4)\)introduce*social bias aggregation*, documenting a similar step\-wise drift and proposing prompt\-based mitigations\.

#### Our position\.

We share withWu et al\. \([2025](https://arxiv.org/html/2605.06672#bib.bib9)\)the motif that bias is not fixed at the output layer but accumulates along the reasoning trajectory\. We differ in three concrete ways\. First, we target*position*bias, a structural property of the prompt format, rather than social bias grounded in the question content\. Second, we use*reasoning length as a continuous predictor*of bias magnitude across 15 configurations and report partial correlation coefficients, rather than only per\-step drift\. Third, our*truncation intervention*directly manipulates the length of exposure rather than merely observing it, providing the cleanest evidence for an accumulated\-exposure mechanism\. Finally, we distinguish length\-driven bias in reasoning mode from a separate*baseline*bias in direct mode, a distinction absent from prior work\.

## 3Method

### 3\.1Matched\-Pair Evaluation Protocol

We construct matched model pairs in which both members share a base model family, isolating reasoning\-style training from base\-model identity:

- •Qwen pair:DeepSeek\-R1\-Distill\-Qwen\-7B↔\\leftrightarrowQwen2\.5\-7B\-Instruct
- •Llama pair:DeepSeek\-R1\-Distill\-Llama\-8B↔\\leftrightarrowLlama\-3\.1\-8B\-Instruct
- •Scale anchor:DeepSeek\-R1 \(671B, via official API\) on MMLU only

For each Instruct model we evaluate both a*direct*mode \(answer only\) and a*CoT*mode \(“let’s think step by step”\), allowing us to separate reasoning*style*from reasoning\-tuned*weights*\.

### 3\.2Permutation Protocol

For each question, we construct four variants by cyclically shifting the answer\-option labels\. If the original ordering is\(A,B,C,D\)\(A,B,C,D\)with correct answer at positionkk, permutations∈\{0,1,2,3\}s\\in\\\{0,1,2,3\\\}places the correct answer at position\(k\+s\)mod4\(k\+s\)\\bmod 4\. Each variant is queried independently\.

### 3\.3Metrics

#### Position Bias Score \(PBS\)\.

Let𝐩¯q∈Δ4\\bar\{\\mathbf\{p\}\}\_\{q\}\\in\\Delta^\{4\}be the mean empirical distribution of the model’s answer over the four permutations of questionqq, aggregated by*absolute answer position*\. Define

PBS​\(q\)=‖𝐩¯q−𝐮‖2,\\textsc\{PBS\}\(q\)\\;=\\;\\\|\\bar\{\\mathbf\{p\}\}\_\{q\}\-\\mathbf\{u\}\\\|\_\{2\},\(1\)where𝐮=\(14,14,14,14\)\\mathbf\{u\}=\(\\tfrac\{1\}\{4\},\\tfrac\{1\}\{4\},\\tfrac\{1\}\{4\},\\tfrac\{1\}\{4\}\)\.

#### Commitment Change Point \(CCP\)\.

The normalized prefix fraction at which the model’s extracted answer first matches, and thereafter remains, the full\-trajectory answer:

CCP=1T​min⁡\{t:a​\(t′\)=a​\(T\)​∀t′≥t\}\.\\textsc\{CCP\}\\;=\\;\\tfrac\{1\}\{T\}\\min\\\!\\bigl\\\{t:a\(t^\{\\prime\}\)=a\(T\)\\ \\forall\\,t^\{\\prime\}\\geq t\\bigr\\\}\.\(2\)

#### Effective Switching \(Eff\-Sw\)\.

Number of answer\-changes along the trajectory normalized by trajectory length, to make comparison across model verbosities meaningful\.

### 3\.4Truncation Intervention

For each question with a detectableCCP, we truncate the trajectory at offsets\{−0\.15,−0\.05,\+0\.05,\+0\.15\}\\\{\-0\.15,\-0\.05,\+0\.05,\+0\.15\\\}relative toCCPand resume generation three times independently from each truncation point\. We record \(a\) whether the final answer*changes*relative to the original, and \(b\) toward which absolute position it changes\.

### 3\.5Datasets

MMLU\(Hendrycks et al\.,[2021](https://arxiv.org/html/2605.06672#bib.bib3)\)\(1000 questions, 200 for API anchor\), ARC\-Challenge\(Clark et al\.,[2018](https://arxiv.org/html/2605.06672#bib.bib1)\)\(496 questions\), and GPQA\(Rein et al\.,[2024](https://arxiv.org/html/2605.06672#bib.bib7)\)\(198 questions\)\. All items are 4\-option MCQ after filtering\.

## 4Experimental Setup

Local models are served viallama\-cpp\-pythonwith Q4\_K\_M quantization on a single NVIDIA A100\-80G\. Greedy decoding for the main experiment; nucleus sampling \(p=0\.95,T=0\.7p=0\.95,T=0\.7\) for truncation continuations\. DeepSeek\-R1 is accessed via the official API and returnsreasoning\_content \+ content, concatenated as the full trajectory\. Extraction uses a regex cascade with a letter\-frequency fallback; extraction rate exceeds 99% in all reasoning\-mode configurations\.

## 5Results

### 5\.1Length\-quartile PBS: a cross\-scale view

We open with the central empirical pattern of the paper\. For each reasoning\-mode model–benchmark combination, we bin questions into four length quartiles \(Q1 = shortest 25% of trajectories, Q4 = longest 25%\) and compute meanPBSper quartile\. Figure[1](https://arxiv.org/html/2605.06672#S5.F1)plots the result for R1\-Qwen\-7B, R1\-Llama\-8B, and DeepSeek\-R1 \(671B, MMLU only\)\.

![Refer to caption](https://arxiv.org/html/2605.06672v1/x1.png)Figure 1:Length\-quartilePBSis monotonic across scales\.R1\-Qwen\-7B and R1\-Llama\-8B showPBSgrowing33–4×4\\timesfrom shortest to longest length quartile on MMLU; similar patterns hold on ARC and GPQA\. At 671B \(MMLU, green\),PBSis essentially zero on the first three quartiles and0\.0710\.071on the longest, showing that the length\-driven mechanism persists at scale but is gated by question difficulty\.For the two open\-weight R1\-distilled models, the effect is sharp: on MMLU,PBSrises from0\.107→0\.151→0\.213→0\.3850\.107\\rightarrow 0\.151\\rightarrow 0\.213\\rightarrow 0\.385for R1\-Qwen\-7B \(3\.6×3\.6\\times\), and from0\.091→0\.187→0\.235→0\.3580\.091\\rightarrow 0\.187\\rightarrow 0\.235\\rightarrow 0\.358for R1\-Llama\-8B \(3\.9×3\.9\\times\)\. Monotonicity holds in 12 of 12 open\-weight \(model×\\timesbenchmark\) combinations we tested\.

#### Cross\-scale anchor\.

At 671B parameters, aggregate MMLUPBSdrops to0\.0190\.019and accuracy rises to 89\.8%\. The first three length quartiles havePBSof0\.000/0\.000/0\.0070\.000/0\.000/0\.007respectively; the longest quartile hasPBS=0\.0710\.071\. We interpret this as evidence that at scale, the correct\-answer signal is strong enough to dominate the accumulated positional pull on short\-to\-medium trajectories, but the length\-driven mechanism still expresses itself on the hardest, longest questions\. The commitment\-timing signature is essentially scale\-invariant:CCP=0\.73\\textsc\{CCP\}=0\.73at 671B vs\.0\.750\.75at 7–8B, suggesting thatCCPindexes a structural property of reasoning\-tuned models that does not scale away\.

### 5\.2Per\-model partial correlations

We next ask whether the quartile\-level pattern is consistent with a continuous, within\-model relationship between length andPBS\. Table[1](https://arxiv.org/html/2605.06672#S5.T1)reports the partial correlationρ​\(length,PBS∣accuracy\)\\rho\(\\mathrm\{length\},\\textsc\{PBS\}\\mid\\mathrm\{accuracy\}\)across all reasoning\-mode and direct\-mode configurations in our experiments\.

Table 1:Partial correlation between per\-question mean trajectory length andPBS, controlling for accuracy\.∗\{\\ast\}:p<0\.05p<0\.05;∗⁣∗\{\\ast\\ast\}:p<0\.01p<0\.01;∗⁣∗⁣∗\{\\ast\\ast\\ast\}:p<10−3p<10^\{\-3\}\. Length predictsPBSconsistently in reasoning mode \(12 of 13 configurations significant atp<0\.05p<0\.05\); the effect is weak or absent in direct mode \(§[5\.4](https://arxiv.org/html/2605.06672#S5.SS4)\)\.12 of 13 reasoning\-mode configurations exhibit a significantly positive partial correlation \(ρ\\rhobetween0\.110\.11and0\.410\.41,p<0\.05p<0\.05\); the single non\-significant exception \(Qwen\-Instruct\-CoT on GPQA,n=198n=198\) is directionally consistent but underpowered\. Figure[2](https://arxiv.org/html/2605.06672#S5.F2)visualizes the relationship per\-configuration for the four local R1\-distilled panels\. Direct\-mode coefficients are substantially weaker, and essentially zero for Llama\-Instruct\-direct; we return to this in §[5\.4](https://arxiv.org/html/2605.06672#S5.SS4)\.

![Refer to caption](https://arxiv.org/html/2605.06672v1/x2.png)Figure 2:Per\-configuration view of the length–PBSrelationship for the four local R1\-distilled panels\. Points are quartile means with standard\-error bars; dashed lines are linear fits on the quartile centers\. Partialρ\\rho\(controlling for accuracy\) is annotated in each panel\.
### 5\.3Truncation intervention: causal evidence

![Refer to caption](https://arxiv.org/html/2605.06672v1/x3.png)Figure 3:Truncation intervention on MMLU for R1\-Qwen\-7B and Qwen\-Instruct\-CoT\.\(a\)Answer change rate by absolute position in the trajectory\.\(b\)Among changed answers, the fraction that shifts toward position A \(the position\-preferred option in our dataset\)\. Both models show a monotonic increase in directional shift with truncation position, consistent with an accumulated\-exposure mechanism\. One Qwen\-Instruct\-CoT bucket \(n=32n=32, directional shift 0%\) is excluded from panel \(b\) as underpowered\.The above analyses are observational\. To establish causality, we run the truncation intervention \(§[3](https://arxiv.org/html/2605.06672#S3)\) on R1\-Qwen\-7B and Qwen\-Instruct\-CoT over MMLU \(200 questions×\\times4 truncation offsets×\\times3 continuations\)\.

For R1\-Qwen\-7B, the directional shift toward position A \(the position\-preferred option in our dataset\) increases monotonically from 16% at trajectory positions 0–0\.3 to 32% at positions 0\.9–1\.0\. Qwen\-Instruct\-CoT shows a qualitatively similar gradient \(21%→\\rightarrow28%\), though with lower magnitude and one noisy bucket\. This demonstrates that*accumulated exposure is a property of the CoT reasoning process itself, not of reasoning\-tuned weights alone*\.

The before\-CCPvs\. after\-CCPchange rate ordering diverges between the two models: R1\-Qwen\-7B shows a decrease \(35%→\\rightarrow22%,χ2​p<10−13\\chi^\{2\}\\ p<10^\{\-13\}\), consistent withCCPmarking a sharp commitment boundary; Qwen\-Instruct\-CoT shows an*increase*\(26%→\\rightarrow45%,χ2​p<10−13\\chi^\{2\}\\ p<10^\{\-13\}\), suggesting that in Instruct\-CoT the post\-CCPportion of the trajectory is dominated by formatting rather than reasoning, so commitment is more fragile under re\-sampling\.

### 5\.4Two sources of position bias

The length\-driven mechanism does not explain all position bias we observe\. Figure[4](https://arxiv.org/html/2605.06672#S5.F4)plotsPBSagainst mean trajectory length for all six \(model, mode\) configurations in each of the three benchmarks\.

![Refer to caption](https://arxiv.org/html/2605.06672v1/x4.png)Figure 4:PBSvs mean trajectory length \(log scale\) for all six \(model, mode\) configurations per benchmark\. Direct\-mode points \(grey\) form a separate cluster with family\-specific baseline bias\. Reasoning\-mode points \(blue / red\) fall along an upward length–PBStrajectory \(light red trend line\)\. Shading demarcates the two regimes\.Two observations stand out\.

\(i\) Direct\-mode position bias is a separate phenomenon\.Llama\-Instruct\-direct exhibits severe baseline bias \(PBS=0\.40/0\.26/0\.610\.40/0\.26/0\.61on MMLU/ARC/GPQA\), far exceeding any reasoning\-mode configuration\. Qwen\-Instruct\-direct, in contrast, has mild baseline bias \(0\.18/0\.05/0\.330\.18/0\.05/0\.33\)\. Direct\-mode partial correlation with trajectory length is near zero for Llama and weak for Qwen \(Table[1](https://arxiv.org/html/2605.06672#S5.T1)\); direct\-mode “trajectory length” is dominated by a handful of tokens anyway\. We interpret this as a*baseline position preference*inherited from training, whose magnitude depends on the base model\.

\(ii\) CoT reasoning replaces baseline bias with length\-accumulated bias\.For Llama, the direct→\\rightarrowCoT transition substantially*reduces*aggregatePBS\(MMLU:0\.40→0\.240\.40\\rightarrow 0\.24; ARC:0\.26→0\.090\.26\\rightarrow 0\.09; GPQA:0\.61→0\.410\.61\\rightarrow 0\.41\), and the further CoT→\\rightarrowR1 transition only slightly changes it \(0\.24→0\.220\.24\\rightarrow 0\.22;0\.09→0\.090\.09\\rightarrow 0\.09;0\.41→0\.430\.41\\rightarrow 0\.43\)\. For Qwen, the direct→\\rightarrowCoT transition barely changesPBS\(MMLU:0\.18→0\.180\.18\\rightarrow 0\.18\), because Qwen’s baseline was already low; the CoT→\\rightarrowR1 transition then adds length\-driven bias \(0\.18→0\.210\.18\\rightarrow 0\.21\)\. Under this framing, the absence of a reasoning\-vs\-Instruct\-CoT gap in the Llama pair \(§[5\.5](https://arxiv.org/html/2605.06672#S5.SS5)\) is a prediction rather than a counterexample: Llama\-Instruct\-CoT already produces substantial reasoning \(1448 MMLU chars on average vs\. 1119 for Qwen\-Instruct\-CoT\), so its length\-driven bias is already partially “paid in\.”

### 5\.5R1 vs Instruct\-CoT aggregate comparison

A natural secondary question is whether reasoning\-tuned models exhibit higher aggregatePBSthan Instruct\-CoT on matched base\-model families\. For the Qwen pair, a one\-sided paired Wilcoxon test rejects equalPBSatp<2×10−5p<2\\times 10^\{\-5\}for all three benchmarks \(R​1R1−\-CoT gap: MMLU\+0\.034\+0\.034, ARC\+0\.043\+0\.043, GPQA\+0\.107\+0\.107\)\. For the Llama pair, all three tests are*not*significant \(MMLUp=0\.99p=0\.99, ARCp=0\.55p=0\.55, GPQAp=0\.18p=0\.18;R​1R1−\-CoT gap: MMLU−0\.020\-0\.020, ARC−0\.002\-0\.002, GPQA\+0\.021\+0\.021\)\.

Under the length\-driven account \(§[5\.1](https://arxiv.org/html/2605.06672#S5.SS1)–§[5\.2](https://arxiv.org/html/2605.06672#S5.SS2)\), the magnitude of the reasoning\-vs\-CoTPBSgap should track the reasoning\-vs\-CoT length gap within a family\. This is what we observe: Qwen’s R1/CoT length ratio is∼\\sim5×\\timeson MMLU, vs\.∼\\sim3×\\timesfor Llama, with correspondingly larger and smallerPBSgaps respectively\. Thus the Llama pair’s null result is consistent with the length mechanism, not a rejection of it\.

## 6Discussion

### 6\.1Why does reasoning length drive position bias?

At each decoding step, the model attends over the prompt \(which contains positional information about the answer options\) and over its own reasoning so far \(which may already reference specific positions\)\. Longer reasoning accumulates more attention weight on positional features; under soft priors, this aggregate shifts the model’s posterior over the final\-answer token toward position\-preferred options\. This is a property of the reasoning*process*, not of reasoning\-tuned weights: we observe the same gradient signature in Instruct\-CoT, though attenuated by shorter trajectories\.

This perspective also reframesCCP\. In reasoning\-tuned models,CCPmarks a sharp commitment boundary because the post\-CCPtrajectory contains comparatively little further option\-referential content; the model has*decided*, and remaining tokens are mostly explanatory\. In Instruct\-CoT, no such sharp boundary exists; exposure continues through formatting, so commitment is fragile under re\-sampling \(as confirmed by the reversed before\-vs\-after\-CCPchange rate\)\.

### 6\.2The two\-source framework

We find that position bias in MCQ is not a single phenomenon but at least two: \(i\) a base\-model\-specific*baseline*bias, visible in direct\-answer mode, that correlates with base\-model training rather than with trajectory length; \(ii\) a reasoning\-process*length\-accumulated*bias, universal across reasoning\-capable configurations and scaling linearly with log length\. CoT reasoning replaces \(i\) with \(ii\)\. For models with severe baseline bias, this is a net reduction; for models with mild baseline bias, it produces a flat or slight increase, and subsequent length extension amplifies \(ii\)\.

A complete account of baseline bias is beyond the scope of this paper; we conjecture it is driven by preference\-tuning data distribution and by tokenizer\-induced sampling asymmetries among lettered options\. Disentangling these is important future work\.

### 6\.3Implications for reasoning\-model evaluation

Our results argue that reasoning\-capable models should*not*be treated as order\-robust by default in MCQ\-style evaluation, judging, or grading pipelines\. Concretely:

- •Permutation averaging is not optionalfor reasoning models used as judges; the relevant bias mechanism \(length\-accumulated\) is amplified, not mitigated, by longer CoT\.
- •Length\-controlled ablationsare needed when comparing reasoning and non\-reasoning baselines; aggregate differences can be driven entirely by length rather than by reasoning quality\.
- •CCP, effective switching, and truncation probesprovide cheap, offline\-computable diagnostics that flag models or question types likely to be particularly bias\-prone before deployment\.

## 7Conclusion

Within any reasoning\-capable language model, per\-question position bias in multiple\-choice QA scales with the length of the reasoning trajectory\. This holds for reasoning\-tuned models, for base models prompted with CoT, and for DeepSeek\-R1 at 671B parameters\. A truncation intervention provides causal evidence for an*accumulated\-exposure*mechanism; a cross\-scale anchor suggests that accuracy gates the expression of the mechanism rather than eliminating it\. Direct\-answer mode hosts a distinct, base\-model\-specific baseline bias that is replaced by the length\-driven mechanism when CoT is engaged\. “More thinking” is not, on its own, a debiasing intervention; evaluation pipelines that use reasoning\-capable models as judges or graders should account for this\.

## Limitations

#### Scale and coverage\.

Our local experiments use 7–8B open\-weight models with a single 671B API anchor; we do not evaluate commercial closed models\.

#### Task format\.

We study 4\-option MCQ with cyclic permutations\. Generalization to open\-ended evaluation, pairwise judging, and ranking tasks is untested\.

#### Language and benchmarks\.

English only\. Non\-English MCQ benchmarks may produce different profiles\.

#### Quantization\.

Local experiments use Q4\_K\_M; we do not run full\-precision controls on the 7–8B pairs\. The 671B API result suggests the effect is not quantization\-specific\.

#### Baseline\-bias mechanism\.

We observe two distinct bias regimes but do not characterize the mechanism of baseline bias\. A full account of the direct\-mode phenomenon is important follow\-up work\.

#### Diagnostic, not mitigation\.

The paper is diagnostic\. We do not propose or evaluate a mitigation algorithm\.

## Ethics Statement

This work uses publicly available MCQ benchmarks \(MMLU, ARC, GPQA\) with no personally identifying content\. All evaluated models are open\-weight or accessed via paid API under the provider’s terms of service\. Our findings concern evaluation reliability of reasoning\-tuned LLMs; disclosure improves rather than harms downstream users\.

## Reproducibility Statement

All non\-API experiments are reproducible on a single A100\-80G with the provided scripts, configs, and HuggingFace model checkpoints\. Random seeds are fixed for data subsetting; generation is greedy for main experiments and uses fixed seeds for truncation continuations\. Full hyperparameters, prompts, and extraction regexes are in Appendix[A](https://arxiv.org/html/2605.06672#A1)\.

## References

- Clark et al\. \(2018\)Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\.Think you have solved question answering? Try ARC, the AI2 reasoning challenge\.*arXiv preprint arXiv:1803\.05457*, 2018\.URL[https://arxiv\.org/abs/1803\.05457](https://arxiv.org/abs/1803.05457)\.
- DeepSeek\-AI et al\. \(2025\)DeepSeek\-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, et al\.DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*Nature*, 645:633–638, 2025\.doi:10\.1038/s41586\-025\-09422\-z\.URL[https://arxiv\.org/abs/2501\.12948](https://arxiv.org/abs/2501.12948)\.arXiv:2501\.12948\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.In*International Conference on Learning Representations \(ICLR\)*, 2021\.URL[https://openreview\.net/forum?id=d7KBjmI3GmQ](https://openreview.net/forum?id=d7KBjmI3GmQ)\.
- Luo et al\. \(2025\)Guoqing Luo, Iffat Maab, Lili Mou, and Junichi Yamagishi\.Investigating thinking behaviours of reasoning\-based language models for social bias mitigation\.*arXiv preprint arXiv:2510\.17062*, 2025\.URL[https://arxiv\.org/abs/2510\.17062](https://arxiv.org/abs/2510.17062)\.
- Pezeshkpour and Hruschka \(2024\)Pouya Pezeshkpour and Estevam Hruschka\.Large language models sensitivity to the order of options in multiple\-choice questions\.In*Findings of the Association for Computational Linguistics: NAACL 2024*, pages 2006–2017, Mexico City, Mexico, 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.findings\-naacl\.130\.URL[https://aclanthology\.org/2024\.findings\-naacl\.130/](https://aclanthology.org/2024.findings-naacl.130/)\.
- Qwen Team \(2024\)Qwen Team\.QwQ: Reflect deeply on the boundaries of the unknown\.[https://qwenlm\.github\.io/blog/qwq\-32b\-preview/](https://qwenlm.github.io/blog/qwq-32b-preview/), 2024\.Blog post\.
- Rein et al\. \(2024\)David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R\. Bowman\.GPQA: A graduate\-level Google\-Proof Q&A benchmark\.In*Conference on Language Modeling \(COLM\)*, 2024\.URL[https://arxiv\.org/abs/2311\.12022](https://arxiv.org/abs/2311.12022)\.
- Wang et al\. \(2024\)Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui\.Large language models are not fair evaluators\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9440–9450, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.511\.URL[https://aclanthology\.org/2024\.acl\-long\.511/](https://aclanthology.org/2024.acl-long.511/)\.
- Wu et al\. \(2025\)Xuyang Wu, Jinming Nian, Ting\-Ruen Wei, Zhiqiang Tao, Hsin\-Tai Wu, and Yi Fang\.Does reasoning introduce bias? A study of social bias evaluation and mitigation in LLM reasoning\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 18534–18555, Suzhou, China, 2025\. Association for Computational Linguistics\.doi:10\.18653/v1/2025\.findings\-emnlp\.1006\.URL[https://aclanthology\.org/2025\.findings\-emnlp\.1006/](https://aclanthology.org/2025.findings-emnlp.1006/)\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\.Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.In*Advances in Neural Information Processing Systems \(NeurIPS\), Datasets and Benchmarks Track*, 2023\.URL[https://arxiv\.org/abs/2306\.05685](https://arxiv.org/abs/2306.05685)\.

## Appendix APrompt Templates and Answer Extraction

This appendix documents the exact prompts and extraction logic used in our experiments\. The source code lives inprompts\.py\(prompt templates\),utils\.py\(extraction, trajectory, andCCPcomputation\), andrun\_main\.py\(local models\) /run\_r1\_api\.py\(671B API driver\)\.

### A\.1Prompt templates

We use three modes, corresponding to direct answering, Instruct\-CoT, and R1\-style reasoning\. Each is implemented as a\(system, user\)pair of chat messages\.

#### Direct mode\.

> SYSTEM:You are a helpful assistant\. Answer the multiple\-choice question with ONLY the letter of the correct answer in this exact format: The answer is \(X\)\. Do NOT explain your reasoning\. Do NOT show any work\. USER:\{question\} \{options\_text\}

#### CoT mode \(Instruct \+ forced reasoning\)\.

> SYSTEM:You are a helpful assistant that solves multiple\-choice questions\. Think step by step before giving your final answer\. Keep your reasoning concise\. After your reasoning, clearly state your final answer as: The answer is \(X\)\. USER:\{question\} \{options\_text\} Think step by step, then give your final answer\.

#### Reasoning mode \(R1\-distilled models and DeepSeek\-R1 671B API\)\.

> SYSTEM:You are a helpful assistant that solves multiple\-choice questions\. Think step by step before giving your final answer\. After your reasoning, clearly state your final answer as: The answer is \(X\)\. USER:\{question\} \{options\_text\} Think step by step, then give your final answer\.

The reasoning and CoT modes use nearly identical user prompts; they differ only in the system prompt’s instruction about verbosity \(CoT asks for concise reasoning, reasoning mode does not\)\. R1\-family models auto\-activate their internal<think\>…</think\>block in response to this prompt; for the 671B API, the same content is returned in thereasoning\_contentfield, which we concatenate withcontentto form the full trajectory\.

The\{options\_text\}string is formatted as"A\. \{option\_A\}\\n B\. \{option\_B\}\\n …"by theformat\_optionshelper\. No few\-shot examples are used\.

### A\.2Final\-answer extraction

We extract the final answer letter from the generated text via an ordered cascade of four regular expressions\. The first pattern that yields a match in\{A,B,C,D\}wins; records with no match are excluded from bias metrics\.

1. 1\.`\[Tt\]he\\s\+answer\\s\+is\\s\*\[\(:\]?\\s\*\(\[A\-D\]\)\\b`\(“The answer is X” / “The answer is \(X\)”\)
2. 2\.`\\\*?\\\*?\[Aa\]nswer\\\*?\\\*?\\s\*\[:\\x\{FF1A\}\]\\s\*\[\(\]?\\s\*\(\[A\-D\]\)\\b`\(“Answer: X” / “\*\*Answer:\*\* X”\)
3. 3\.`\\\\boxed\\\{\(\[A\-D\]\)\\\}`\(LaTeX box, common in R1 traces\)
4. 4\.`\\b\(\[A\-D\]\)\\s\*\\\.?\\s\*$`\(single terminal letter on its own line\)

Patterns 1 and 2 additionally accept Chinese\-style full\-width colon \(U\+FF1A\) to handle occasional code\-switched outputs from the distilled R1\-Qwen checkpoints\. Extraction rates per configuration are reported in Table[2](https://arxiv.org/html/2605.06672#A2.T2); all exceed 95%\.

### A\.3Trajectory extraction forCCPand switching

The definition ofCCPin Eq\. \([2](https://arxiv.org/html/2605.06672#S3.E2)\) is computed from a*trajectory*of answer mentions within the generated text\. A trajectory is the sequence of positions at which the text*commits to*a specific letter in\{A,B,C,D\}\\\{A,B,C,D\\\}\. We extract trajectory entries by scanning the text for any of the following nine patterns \(case\-insensitive\), yielding triples \(letter, char\-position, normalized\-position\) sorted in order of appearance:

1. 1\.`\[Oo\]ption\\s\+\(\[A\-D\]\)\\b`\(“Option X”\)
2. 2\.`\(?:\[Aa\]nswer\|\[Cc\]hoice\|\[Ss\]elect\)\\s\*\(?:is\|\[:\\x\{FF1A\}\]\)?\\s\*\[\(\]?\(\[A\-D\]\)\\b`\(answer/choice/select \+ X\)
3. 3\.`\\\(\(\[A\-D\]\)\\\)`\(parenthesized letter\)
4. 4\.`\\b\(\[A\-D\]\)\\\.\\s`\(letter followed by period and space\)
5. 5\.`\\b\(\[A\-D\]\)\\s\+\(?:is\|seems?\|looks?\|appears?\)\\s\+\(?:correct\|right\|better\|more\)`\(“X is correct”\)
6. 6\.`\(?:choose\|pick\|go\\s\+with\|lean\\w\*\\s\+\(?:towards?\|to\)\)\\s\+\[\(\]?\(\[A\-D\]\)\\b`\(“choose X”, “pick X”, “lean toward X”\)
7. 7\.`\[Ss\]o\\s\+\(?:it’?s\|the\\s\+answer\\s\+is\)?\\s\*\[\(\]?\(\[A\-D\]\)\\b`\(“so it’s X”\)
8. 8\.`\(?:I\\s\+\(?:think\|believe\)\)\\s\+\.\*\\b\(\[A\-D\]\)\\b`\(“I think X”\)
9. 9\.`\\\\boxed\\\{\(\[A\-D\]\)\\\}`\(LaTeX box\)

Duplicate matches at the same character position are removed\. Given the trajectoryτ=\[\(ℓ1,p1\),…,\(ℓn,pn\)\]\\tau=\[\(\\ell\_\{1\},p\_\{1\}\),\\ldots,\(\\ell\_\{n\},p\_\{n\}\)\]and the extracted final answera​\(T\)a\(T\), we implement Eq\. \([2](https://arxiv.org/html/2605.06672#S3.E2)\) as

CCP=\{0if​ℓi=a​\(T\)​∀i,pi⋆\+1otherwise, where​i⋆=max⁡\{i:ℓi≠a​\(T\)\},\\textsc\{CCP\}=\\begin\{cases\}0&\\text\{if \}\\ell\_\{i\}=a\(T\)\\ \\forall i,\\\\ p\_\{i^\{\\star\}\+1\}&\\text\{otherwise, where \}i^\{\\star\}=\\max\\\{i:\\ell\_\{i\}\\neq a\(T\)\\\},\\end\{cases\}i\.e\., the normalized position of the first mention after the last non\-final\-answer mention\. Effective switching \(Eff\-Sw\) is computed on the same trajectory as the number of transitionsℓi→ℓi\+1\\ell\_\{i\}\\to\\ell\_\{i\+1\}withℓi≠ℓi\+1\\ell\_\{i\}\\neq\\ell\_\{i\+1\}, normalized by trajectory length\.

### A\.4Separation of thinking from response

For R1\-family models that output a thinking block, we use a single regex to separate the thinking trace from the final\-response prose:

\(DOTALL mode\)\. The first group is the thinking trace; the second group is the post\-thinking response\. If no<think\>tags are present, the thinking trace is empty and the full text is treated as the response\. For the 671B API, the API separately returnsreasoning\_content\(thinking\) andcontent\(response\), so no regex is needed\.

Our analyses ofCCPand switching operate on the*full*trajectory \(thinking \+ response concatenated\) for all reasoning\-mode configurations\.

## Appendix BPer\-Configuration Summary and Extraction Rates

Table[2](https://arxiv.org/html/2605.06672#A2.T2)reports the full per\-configuration summary referenced in §[5](https://arxiv.org/html/2605.06672#S5)\.

Table 2:Full per\-configuration results\. Acc = accuracy;PBS= Position Bias Score;CCP= Commitment Change Point;Eff\-Sw= effective switching \(switch\-count normalized by trajectory length\); Len = mean trajectory length in characters\.†The 671B API returns switch counts on a different scale; see §[3\.3](https://arxiv.org/html/2605.06672#S3.SS3)for definition\.#### Extraction rates\.

We define extraction rate as the fraction of generated records for which the cascade in Appendix[A](https://arxiv.org/html/2605.06672#A1)yields a letter in\{A,B,C,D\}\\\{A,B,C,D\\\}\. Across all 19 \(model, mode, benchmark\) configurations, extraction rate exceeds 95%; for all reasoning\-mode configurations on MMLU and ARC it exceeds 99%\. The lowest observed rate is on GPQA with R1\-Qwen\-7B \(95\.5%\) and R1\-Llama\-8B \(97\.0%\), attributable to trajectory truncation on the 8192\-token context on a small fraction of highly verbose items\.

## Appendix CLength\-Bucket Robustness

Figure[1](https://arxiv.org/html/2605.06672#S5.F1)in the main text uses four length quartiles\. To rule out bucketing artifacts, we repeat the analysis withk∈\{3,5,10\}k\\in\\\{3,5,10\\\}equal\-frequency bins, for the four local R1\-distilled configurations\. Table[3](https://arxiv.org/html/2605.06672#A3.T3)summarizes the result\.

Table 3:Length\-bucket robustness\. ✓ indicates strictly monotonicPBSincrease from the shortest to the longest bin;viol@iiindicates the first bin at which an out\-of\-order adjacent pair occurs\. End\-point gradient \(longest\-binPBSminus shortest\-binPBS\) is strictly positive in all 18 cells at allkk, mean gradient 0\.290\.Strict monotonicity holds in 10 of 18 cells\. Among the 8 non\-monotonic cells, all violations occur in single adjacent\-bin pairs in the middle of the distribution, and the end\-point gradient \(longest bin−\-shortest bin\) remains positive in 18 of 18 cells at everykk, with mean gradient0\.2900\.290across configurations\. In fact, largerkkproduces*larger*end\-point gradients for most configurations \(e\.g\., R1\-Qwen\-7B on GPQA:0\.2960\.296atk=3k=3vs\.0\.4990\.499atk=10k=10\), indicating that the underlying length\-PBSrelationship is stronger than thek=4k=4quartile view suggests; the local non\-monotonicity at highkkis consistent with increased within\-bin sampling noise \(each bin atk=10k=10contains only∼\\sim25 questions for the smallest benchmarks\)\. We conclude that the monotonic trend reported in the main text is robust to bucketing choice\.

## Appendix DR1 vs Instruct\-CoT Wilcoxon Detail

Table[4](https://arxiv.org/html/2605.06672#A4.T4)reports the full one\-sided paired Wilcoxon signed\-rank test comparing R1\-reasoningPBSto InstructPBS\(bothdirectandCoT\) at the per\-question level, paired byquestion\_idx\.

Table 4:Full paired Wilcoxon signed\-rank test results, one\-sided alternative \(R​1R1\-reasoningPBS*greater than*reference modePBS\)\.pp\-values of1\.01\.0under the one\-sided alternative indicate*the opposite direction is significant*: the†rows show that R1\-Llama\-8B has significantly*lower*PBSthan Llama\-Instruct\-direct across all three benchmarks, corroborating the two\-source framework \(§[5\.4](https://arxiv.org/html/2605.06672#S5.SS4)\)\.n\.s\.: not significant under either direction\.The three†rows are notable: they provide independent statistical confirmation that*CoT reasoning substantially reduces position bias in Llama\-Instruct, relative to direct answering*, consistent with the baseline\-bias account in §[5\.4](https://arxiv.org/html/2605.06672#S5.SS4)\. The analogous Qwen rows show the opposite pattern because Qwen’s direct\-mode baseline bias is already mild\.

## Appendix EPer\-Question Trajectory Examples

We include three illustrative trajectories from our R1\-Qwen\-7B MMLU run, one per category\. Full trajectories are provided in the accompanying supplementary material; we excerpt beginnings here\.

### Example 1: short trajectory, low PBS

Subject: abstract algebra; question ID: 43; mean trajectory length: 992 chars; per\-questionPBS:0\.0000\.000\(identical answer across all four permutations\); correctly answered on all four permutations\.

> Okay, so I have this problem about group theory\. The question is asking for the identity element in a set of integersℤ\\mathbb\{Z\}with a binary operation defined asa∗b=a\+b\+1a\*b=a\+b\+1\. Hmm, groups…\\ldotsright, they require an identity element where when you operate any element with it, you get the same element back\. Let me recall what an identity elementeeshould satisfy: for anya∈ℤa\\in\\mathbb\{Z\},a∗e=aa\*e=aande∗a=ae\*a=a\.…\\ldots

When the model can solve the question quickly, the trajectory does not engage with the answer\-option letters at length, and the answer is therefore stable across permutations\.

### Example 2: long trajectory, high PBS

Subject: econometrics; question ID: 745; mean trajectory length:10,81610\{,\}816chars \(11×11\\timeslonger than Example 1\); per\-questionPBS:0\.6120\.612\(near the theoretical maximum of0\.750\.75\); incorrectly answered on all four permutations; mean of9\.29\.2option\-referential phrases per trajectory\.

> Okay, so I’ve got this multiple\-choice question about confidence intervals for the intercept term in a regression model\. Hmm, let me think through how to approach this\. First, I remember that a confidence interval gives a range of values within which we believe the true population parameter lies, with a certain level of confidence — in this case, 95%\. The formula for a confidence interval is usually something like:CI=estimate±\(critical value\)⋅\(standard error\)​…\\mathrm\{CI\}=\\mathrm\{estimate\}\\pm\(\\text\{critical value\}\)\\cdot\(\\text\{standard error\}\)\\ldots

On this question, the model cycles through multiple option\-letter references \(“AAwould imply…\\ldots”, “let me re\-checkCC”, etc\.\) and ends in a different position depending on permutation\. This is the canonical accumulated\-exposure signature predicted by the length\-driven account\.

### Example 3: truncation continuation that shifts to position A

Question ID: 27 \(MMLU\); truncation offset:\+0\.15\+0\.15relative toCCP\(absolute trajectory position0\.7560\.756\); prefix length:2,1152\{,\}115tokens; continuation length:3,5773\{,\}577tokens \(re\-sampled withT=0\.7T=0\.7\); original trajectory answer: C; resumed continuation’s final answer:A\.

The intervention log records metric\-level information only; raw continuation text was not retained to keep the log compact\. The flip is nonetheless an informative instance of the mechanism: the prefix has already “committed” to C \(the extracted answer from the original complete trajectory\), yet re\-sampling from a point22\.5%22\.5\\%into the post\-CCPregion produces a3,5773\{,\}577\-token continuation that converges on A — the position\-preferred option in our dataset\. Aggregate statistics from this class of continuations drive the monotonic directional\-shift gradient reported in Figure[3](https://arxiv.org/html/2605.06672#S5.F3)\(b\)\.

Similar Articles

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

arXiv cs.LG

This paper reveals that aggressive post-training quantization of reasoning models leads to increased overthinking errors, where models reach correct intermediate answers but fail to finalize them. A simple logit penalty on overthinking markers reduces chain-of-thought length by 12-23% while improving accuracy, especially for quantized models.

Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL

This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Papers with Code Trending

This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

arXiv cs.LG

This paper introduces a trace-level diagnostic for evaluating chain-of-thought reasoning, separating susceptibility (whether bias changes the answer) from acknowledgment (whether the trace flags the biased input). Experiments show models like GPT-4o and Claude Sonnet 4 have similar susceptibility rates but very different acknowledgment rates, highlighting a blind spot in accuracy-only evaluation.