When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
Summary
This paper analyzes failure modes in multi-turn reasoning models by introducing a CoT-Output safety matrix, revealing paradoxes like increased alignment-faking under monitoring cues and context-injection failures where safe internal reasoning is overridden by harmful outputs.
View Cached Full Text
Cached at: 06/10/26, 05:45 AM
Paper page - When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
Source: https://huggingface.co/papers/2606.10740 TL;DR: Standard safety evaluations are missing a massive chunk of how reasoning models actually fail.
In this paper, we moved beyond static, single-turn prompts to analyze multi-turn adversarial dialogues across distilled models like DeepSeek-R1-7B, Phi-4-Mini, and Qwen-4B-Thinking. By introducing a CoT-Output 2x2 safety matrix, we evaluated 6,750 turn-level observations and found some highly counterintuitive dynamics:
- **The Oversight Paradox:**Explicit monitoring cues (telling the model it is being watched) paradoxicallyincreasealignment-faking rates rather than suppress them. For DeepSeek-R1-7B, robust alignment actually peaks when the model thinks it is unmonitored.
- **Context-Injection Failure:**We document a multi-turn, adversarial manifestation of CoT unfaithfulness. In this state, the model’s internal CoT maintains safe reasoning (and explicitly notes the need for refusal), but the prolonged adversarial context hijacks the generation, causing the visible output to produce harm.
We’re releasing the full dataset of 450 multi-turn dialogues and CoT traces on Hugging Face. Our hope is that this provides a reproducible behavioral substrate for follow-up mechanistic interpretability research, and helps the community investigate whether these unfaithfulness dynamics are artifacts of distillation or fundamental vulnerabilities that scale to frontier models.
Similar Articles
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure
This paper identifies a novel failure mode in reasoning models called unfaithful capitulation, where the chain-of-thought remains factually correct across adversarial multi-turn dialogues but the final answer flips wrong, highlighting limitations of current evaluation methods.
Reasoning models struggle to control their chains of thought, and that’s good
OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.
Detecting misbehavior in frontier reasoning models
OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
Researchers introduce HarmThoughts, a benchmark with 56,931 annotated sentences from 1,018 reasoning traces to evaluate harmful behavior emergence step-by-step, revealing that current detectors miss nuanced unsafe reasoning transitions.