Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

arXiv cs.CL Papers

Summary

This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.

arXiv:2603.03332v3 Announce Type: replace Abstract: Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (>5% loss even for midsized models); ExtraSteps incur minimal accuracy degradation (0-6%) even for the smallest of models; Sycophancy and SkippedSteps produce modest effects (~10% loss for small models) and slightly improve with scale. Scaling relationships show that model size serves as a protective factor against many perturbations but not always. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available at https://github.com/Mystic-Slice/CoTPerturbation
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Source: https://arxiv.org/html/2603.03332
Ashwath Vaithinathan Aravindan
University of Southern California, Los Angeles, 90007, California, United States of America
[email protected]
Mayank Kejriwal
Information Sciences Institute, 4676 Admiralty Way #1001, Los Angeles, 90292, California, United States of America

###### Abstract

Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (>5% loss even for mid-sized models); ExtraSteps incur minimal accuracy degradation (0-6%) even for the smallest models; Sycophancy and SkippedSteps produce modest effects (~10% loss for small models) and slightly improve with scale. Scaling relationships show that model size serves as a protective factor against many perturbations but not always. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available here (https://github.com/Mystic-Slice/CoTPerturbation).

###### keywords:

Large Language Model, Robustness, Chain-of-Thought, LLM Reasoning

## Introduction

Large Language Models have emerged as transformative tools across diverse domains, from natural language understanding to scientific discovery. A defining strength of these models is their capacity to perform complex reasoning tasks that require multiple steps of logic or calculation. As LLMs are increasingly deployed in applications where accuracy and reliability are critical, understanding their reasoning capabilities and limitations has become essential. The success of these models on complex tasks hinges not merely on pattern recognition, but on their ability to transparently reason through problems in ways that humans can understand and verify.

Chain-of-Thought (CoT) prompting has become an important technique for eliciting complex reasoning from Large Language Models (LLMs). Providing intermediate reasoning steps has been shown to dramatically improve performance on mathematical problem-solving and multi-step reasoning tasks. Building on this, even zero-shot CoT prompting, using simple phrases such as "Let's think step by step", can unlock latent reasoning capabilities in LLMs without annotated examples. This success has established CoT as a de facto standard in prompting strategies for reasoning-intensive applications.

Yet this success raises a fundamental question: to what extent are LLMs genuinely performing step-by-step logical reasoning, and to what extent are they exploiting surface-level patterns learned during training? When a model produces a correct final answer following a CoT trajectory, does it verify the consistency of intermediate steps, or does it simply correlate reasoning text with expected outputs? This distinction carries immediate practical implications for high-stakes applications such as finance, medicine, and scientific discovery, where understanding whether models achieve accuracy through robust reasoning or through brittle pattern matching is essential for safe deployment.

Recent empirical work has exposed concerning fragility in CoT reasoning. Single-character typographical errors have been shown to significantly degrade accuracy on mathematical benchmarks. Semantically adversarial perturbations to code-reasoning problems reduce accuracy by over 42%. A "snowball" effect has been identified where errors early in reasoning chains amplify through subsequent steps. These findings collectively highlight the brittleness of LLM reasoning to input corruptions. However, existing studies focus narrowly on specific perturbation types (typos, code-level attacks) or isolated models, leaving unanswered the question of how diverse, reasoning-specific corruptions affect multiple model families across different scales and architectures.

In real-world deployment, reasoning chains may be incomplete, contain computational errors, or originate from upstream systems of varying quality. Assessing how LLMs handle such realistic corruptions is essential for building trustworthy multi-stage reasoning pipelines. Yet no prior work has systematically evaluated a broad range of models against a comprehensive, structured taxonomy of reasoning-specific perturbations.

To fill this gap, we present a systematic evaluation of LLM robustness to CoT perturbations. Our contributions are:

1. A structured perturbation taxonomy comprising 5 reasoning-specific types: mathematical errors, unit conversion, sycophancy, skipped steps, and extra steps;
2. A broad empirical evaluation across 13 language models spanning three orders of magnitude in parameter count, revealing how robustness scales with model size and varies across perturbation types;
3. Quantitative characterization of differential scaling relationships, showing that robustness improvements are heterogeneous: steep for mathematical errors, shallow for sycophancy and skipped steps, and absent for redundant information.

The rest of this paper is structured as follows. We first survey related work on CoT reasoning and robustness evaluation. Next, we introduce our perturbation taxonomy and experimental methodology. We then present our empirical findings across multiple models and perturbation types. Finally, we discuss the implications and limitations of our work.

## Preliminaries

We begin by establishing formal definitions of chain-of-thought reasoning and perturbations to facilitate precise analysis throughout this work.

**Chain-of-Thought Prompting.** Let ℳ denote a language model parametrized by θ. Given a problem instance x and a prompt template Π, the model generates a sequence of tokens, which we decompose into intermediate reasoning steps and a final answer. Formally, we can represent a complete CoT response as:

ℳ(x, Π) = ⟨r₁, r₂, ..., rₙ, a⟩     (1)

where rᵢ denotes the i-th intermediate reasoning step and a denotes the final answer. Each rᵢ is a sequence of tokens produced by the model. The prompt template Π includes the instruction "Let's think step by step" or variants thereof, which conditions the model to produce this step-by-step decomposition.

**Corrupted Reasoning Chains.** We define a perturbation function 𝒫 that modifies the reasoning chain before the model processes it. Given a ground-truth reasoning chain R = ⟨r₁, r₂, ..., rₙ⟩ and a perturbation type τ ∈ {MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps}, the corrupted chain is:

R' = 𝒫_τ(R) = ⟨r'₁, r'₂, ..., r'ᵢ⟩     (2)

The model then processes this corrupted chain in context: ℳ(R', x, Π) = ⟨r'₁, ..., r'ₙ, a'⟩, where a' is the answer produced conditioned on the perturbed reasoning. The robustness of model ℳ to perturbation τ is quantified as:

Robustness_τ(ℳ) = Accuracy(a' = a_gold) / Accuracy(a = a_gold)     (3)

where a_gold is the correct answer and Accuracy is measured over a test set. A robustness score near 1 indicates the model maintains correctness despite perturbations, while a score near 0 indicates severe degradation.

**Scaling and Heterogeneous Perturbation Effects.** Our analysis examines how robustness evolves across models of varying scales. Let ℳ_s denote a model family with parameter counts s ∈ {7B, 13B, 70B, 405B, ...}. We investigate whether robustness to perturbation τ exhibits:

Robustness_τ(ℳ_{s₁}) ≶ Robustness_τ(ℳ_{s₂})  for s₁ < s₂     (4)

This relationship may vary across perturbation types, giving rise to heterogeneous scaling patterns that are central to our empirical investigation.

## Related Work

**Input robustness and adversarial perturbations.** The question of LLM robustness to corrupted or adversarially perturbed inputs has become increasingly important as these systems are deployed in real-world applications. Singh et al. systematically examined LLM robustness to real-world text perturbations, including spelling errors, OCR noise, and synonym substitution, demonstrating that many generative LLMs show surprising robustness to these common noise types. However, Alahmari et al. revealed a critical counterpoint: models trained exclusively on clean data produce unpredictable outputs under even minor perturbations like single-character typos, suggesting that training set composition significantly determines perturbation tolerance. Bogavelli et al. scaled this analysis to enterprise contexts, finding that prompt formatting variations, word-order changes, and language variations can degrade performance by up to 40 percentage points, with the surprising finding that smaller models sometimes maintain consistency better than larger ones across these transformations. Beyond surface-level input noise, PromptBench provides a comprehensive evaluation framework that characterizes LLM fragility to adversarially crafted instructions themselves, covering semantic attacks, structural modifications, and character-level perturbations. DeceptPrompt demonstrates concrete exploitation strategies, showing how adversarial natural language instructions can systematically mislead code generation models into producing incorrect or insecure code. Our work shifts focus from surface-level input noise and instruction-level attacks to a complementary regime: perturbations introduced at intermediate steps of reasoning chains, which represents a failure mode distinct from input-level fragility.

**Chain-of-thought reasoning and its variants.** Wei et al. demonstrated that prompting models to articulate reasoning steps dramatically improves performance on complex reasoning tasks, while Kojima et al. showed this effect persists even without task-specific examples. Building on this foundation, researchers have proposed variants to enhance CoT reasoning: Plan-and-Solve improves zero-shot performance by encouraging explicit planning before solving, and Program of Thoughts separates logical reasoning from computation, allowing models to delegate numerical operations. Yet a troubling gap has emerged between apparent reasoning and actual understanding. Turpin et al. identified "Clever Hans" behavior, where models exploit surface-level correlations rather than engaging in genuine reasoning. This concern is validated by ProcessBench and DeltaBench, which reveal that correct final answers frequently coexist with severe internal reasoning errors, suggesting that standard accuracy metrics mask fundamental fragility in intermediate reasoning steps. On the adversarial front, Gan et al. quantified how even single-character typos severely degrade reasoning accuracy (e.g., reducing Mistral-7B's GSM8K performance from 43.7% to 38.6%), while Xiang et al. introduced BadChain, a targeted backdoor attack that injects subtle semantic violations into reasoning chains, demonstrating how CoT's step-by-step structure creates new attack surfaces. Beyond typos, Roh et al. and Yue et al. showed that adversarial perturbations to code-reasoning tasks reduce accuracy by over 42%, revealing that CoT structure itself can be systematically exploited across multiple domains. At a deeper level, Mirzadeh et al. developed GSM-Symbolic to expose a fundamental limitation: models that solve math word problems correctly fail when symbolic or semantic transformations are applied (e.g., renaming variables or rearranging logical structure), indicating reliance on spurious correlations rather than compositional understanding. Zhu et al. identified the "snowball" effect, where errors introduced early in a reasoning chain amplify through subsequent steps due to error cascading, and proposed AdvChain, an adversarial fine-tuning approach that improves robustness by training on corrupted reasoning chains.

**Error detection, verification, and reasoning biases.** To address CoT brittleness, the community has explored complementary mitigation strategies. Zhang et al. and Guo et al. demonstrated that training-based approaches can enable models to verify and correct their own reasoning through supervised learning and reinforcement learning, effectively teaching models to detect and repair errors. Zhang et al. further showed that errors near the end of reasoning chains are disproportionately harmful to final correctness, motivating targeted intervention at high-risk positions. Beyond training, evaluation metrics have evolved to capture reasoning quality more precisely. ROSCOE

Similar Articles

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

arXiv cs.AI

This paper evaluates three approaches (pure chain-of-thought reasoning, single-shot code execution, and iterative code execution) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5, finding that chain-of-thought is the most robust to perturbation, while code execution does not improve reasoning robustness on grade-school math problems.