Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Hugging Face Daily Papers Papers

Summary

This paper introduces a method for monitoring the reasoning process of Large Reasoning Models by analyzing probe trajectories—the evolution of a concept's probability across generated tokens. The approach uses temporal and signal-processing features from hidden representations to better predict future model behavior, achieving up to 95% AUROC with max-pooling.

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.
Original Article
View Cached Full Text

Cached at: 05/19/26, 10:31 AM

Paper page - Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Source: https://huggingface.co/papers/2605.18549

Abstract

Chain of Thought reasoning in Large Reasoning Models shows improved safety monitoring through temporal analysis of hidden representations, where probe trajectories and signal-processing features enhance prediction of future model behavior compared to static approaches.

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through theirChain of Thought(CoT) reasoning. However, CoT is not always faithful to the model’s final output, undermining its reliability as a monitoring tool. To address this, we investigate thehidden representationsof LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct aprobe trajectory, the continuous evolution of a concept’s probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize thesetemporal dynamics, we extractsignal-processing featuresthat capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice ofpooling operationis critical:average-poolingand last-token methods collapse to near-random performance, whilemax-poolingachieves up to 95%AUROCand yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate thattrajectory featuresencode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.18549

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18549 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18549 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18549 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL

This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.

Evaluating chain-of-thought monitorability

OpenAI Blog

OpenAI researchers introduce a framework and suite of 13 evaluations to systematically measure chain-of-thought monitorability in large language models, finding that monitoring reasoning processes is substantially more effective than monitoring outputs alone, with important implications for AI safety and supervision at scale.

Reasoning models struggle to control their chains of thought, and that’s good

OpenAI Blog

OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

arXiv cs.AI

This paper introduces a prefix-level trajectory evaluation protocol to distinguish harmful overthinking from verbose but harmless overthinking in large reasoning models, showing that continued reasoning after reaching the correct answer can destabilize performance. The authors find that early stopping improves accuracy by up to 21% on multimodal benchmarks, and identify logical drift and visual reinterpretation as key causes of correctness deviations.

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

arXiv cs.AI

This paper introduces Behavior Cue Reasoning, a method that trains LLMs to emit specific token sequences before behaviors, making reasoning traces more monitorable and controllable. It demonstrates that this approach improves safety oversight and efficiency by allowing external monitors to prune wasted reasoning tokens and intercept unsafe actions without sacrificing performance.