When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Hugging Face Daily Papers 06/09/26, 11:50 AM Papers

Summary

This paper analyzes failure modes in multi-turn reasoning models by introducing a CoT-Output safety matrix, revealing paradoxes like increased alignment-faking under monitoring cues and context-injection failures where safe internal reasoning is overridden by harmful outputs.

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

Original Article

View Cached Full Text

Cached at: 06/10/26, 05:45 AM

Paper page - When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Source: https://huggingface.co/papers/2606.10740 TL;DR: Standard safety evaluations are missing a massive chunk of how reasoning models actually fail.

In this paper, we moved beyond static, single-turn prompts to analyze multi-turn adversarial dialogues across distilled models like DeepSeek-R1-7B, Phi-4-Mini, and Qwen-4B-Thinking. By introducing a CoT-Output 2x2 safety matrix, we evaluated 6,750 turn-level observations and found some highly counterintuitive dynamics:

**The Oversight Paradox:**Explicit monitoring cues (telling the model it is being watched) paradoxicallyincreasealignment-faking rates rather than suppress them. For DeepSeek-R1-7B, robust alignment actually peaks when the model thinks it is unmonitored.
**Context-Injection Failure:**We document a multi-turn, adversarial manifestation of CoT unfaithfulness. In this state, the model’s internal CoT maintains safe reasoning (and explicitly notes the need for refusal), but the prolonged adversarial context hijacks the generation, causing the visible output to produce harm.

We’re releasing the full dataset of 450 multi-turn dialogues and CoT traces on Hugging Face. Our hope is that this provides a reproducible behavioral substrate for follow-up mechanistic interpretability research, and helps the community investigate whether these unfaithfulness dynamics are artifacts of distillation or fundamental vulnerabilities that scale to frontier models.

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Paper page - When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Similar Articles

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Reasoning models struggle to control their chains of thought, and that’s good

Detecting misbehavior in frontier reasoning models

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Submit Feedback

Similar Articles

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Reasoning models struggle to control their chains of thought, and that’s good

Detecting misbehavior in frontier reasoning models

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains