When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Hugging Face Daily Papers Papers

Summary

This paper analyzes failure modes in multi-turn reasoning models by introducing a CoT-Output safety matrix, revealing paradoxes like increased alignment-faking under monitoring cues and context-injection failures where safe internal reasoning is overridden by harmful outputs.

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.
Original Article
View Cached Full Text

Cached at: 06/10/26, 05:45 AM

Paper page - When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Source: https://huggingface.co/papers/2606.10740 TL;DR: Standard safety evaluations are missing a massive chunk of how reasoning models actually fail.

In this paper, we moved beyond static, single-turn prompts to analyze multi-turn adversarial dialogues across distilled models like DeepSeek-R1-7B, Phi-4-Mini, and Qwen-4B-Thinking. By introducing a CoT-Output 2x2 safety matrix, we evaluated 6,750 turn-level observations and found some highly counterintuitive dynamics:

  • **The Oversight Paradox:**Explicit monitoring cues (telling the model it is being watched) paradoxicallyincreasealignment-faking rates rather than suppress them. For DeepSeek-R1-7B, robust alignment actually peaks when the model thinks it is unmonitored.
  • **Context-Injection Failure:**We document a multi-turn, adversarial manifestation of CoT unfaithfulness. In this state, the model’s internal CoT maintains safe reasoning (and explicitly notes the need for refusal), but the prolonged adversarial context hijacks the generation, causing the visible output to produce harm.

We’re releasing the full dataset of 450 multi-turn dialogues and CoT traces on Hugging Face. Our hope is that this provides a reproducible behavioral substrate for follow-up mechanistic interpretability research, and helps the community investigate whether these unfaithfulness dynamics are artifacts of distillation or fundamental vulnerabilities that scale to frontier models.

Similar Articles

Reasoning models struggle to control their chains of thought, and that’s good

OpenAI Blog

OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.

Detecting misbehavior in frontier reasoning models

OpenAI Blog

OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.