Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

arXiv cs.AI Papers

Summary

This paper introduces Behavior Cue Reasoning, a method that trains LLMs to emit specific token sequences before behaviors, making reasoning traces more monitorable and controllable. It demonstrates that this approach improves safety oversight and efficiency by allowing external monitors to prune wasted reasoning tokens and intercept unsafe actions without sacrificing performance.

arXiv:2605.07021v1 Announce Type: new Abstract: Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at https://github.com/christopherzc/text-games
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:10 AM

# Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
Source: [https://arxiv.org/html/2605.07021](https://arxiv.org/html/2605.07021)
Christopher Z\. Cui1Taylor W\. Killian2Prithviraj Ammanabrolu1 1University of California, San Diego2Brigham Young University czcui@ucsd\.edu

###### Abstract

Reasoning in Large Language Models \(LLMs\) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes\. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable\. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers\. Our experiments reveal that a Behavior Cue Reasoning model has equal to greater performance as the base model, allows for steerable reasoning through external enforcement of Behavior Cues, and improves the monitorability of reasoning for external oversight monitors\. When fine\-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving\. When leveraged by an almost optimal rule\-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%\. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance\. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight\.[Code](https://github.com/christopherzc/text-games)

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.07021v1/figures/fig1_v5.png)Figure 1:We take reasoning traces from a base actor model \(1\) and elicit the immediate working answer at each step by truncating the reasoning, appending a stop phrase, and allowing the model to produce an answer \(2\)\. Working answers are filtered to emulate the behavior of the working answeronlysurfacing when changing and embedded back into the reasoning trace with Behavior Cues \(3\)\. Post SFT \(4\), the Behavior Cue Reasoning Model adopts this answer reporting behavior, creating ideal decision points for an external oversight monitor\. InEfficiency Monitoring, the monitor optimizes stopping once the correct final answer surfaces\. InSafety Monitoring, the monitor attempts to prevent the commitment of unsafe actions by continuing reasoning\. InSafety Enforcement, the monitor records the latest safe action to commit instead, allowing for secured unsafe reasoning\.For difficult tasks, Reasoning LLMs are trained to search within their token space by generating reasoning traces to refine their final answer\. These traces allow the model to trade inference\-time compute for better performance\[[27](https://arxiv.org/html/2605.07021#bib.bib27),[41](https://arxiv.org/html/2605.07021#bib.bib41),[2](https://arxiv.org/html/2605.07021#bib.bib2)\], but are typically treated as artifacts excluded from the message history with the exception of tool\-use\[[9](https://arxiv.org/html/2605.07021#bib.bib9)\]\. Most critically, these behaviors can remain hidden in the reasoning trace, surfacing only when the model commits to an answer\. Without deeper visibility into these reasoning traces, an external monitor tasked with oversight has limited ability to detect and interrupt these behaviors before their negative effects manifest in the model’s final response\.

Prior approaches either inject steering phrases into the reasoning trace at decoding time\[[19](https://arxiv.org/html/2605.07021#bib.bib19),[42](https://arxiv.org/html/2605.07021#bib.bib42),[49](https://arxiv.org/html/2605.07021#bib.bib49),[40](https://arxiv.org/html/2605.07021#bib.bib40),[25](https://arxiv.org/html/2605.07021#bib.bib25)\], or train the model to condition on special tokens inserted externally into its context\[[11](https://arxiv.org/html/2605.07021#bib.bib11),[21](https://arxiv.org/html/2605.07021#bib.bib21),[39](https://arxiv.org/html/2605.07021#bib.bib39)\]\. Some combine both, using special tokens the model is trained to respond to as steering phrases\[[30](https://arxiv.org/html/2605.07021#bib.bib30)\]\. In all cases, the phrase or tokens are treated as external signals\. The model itself does not surface information about its own underlying behavior, limiting applicability to oversight\.

To address this, we introduce Behavior Cues to progress scalable oversight by training the model to naturally generate special token sequences that signal specific behaviors during reasoning, including behaviors that would otherwise be internal to the model\. We define this as Behavior Cue Reasoning\. The explicit appearance of Behavior Cues in a model’s reasoning allows an external monitor to track behavioral progression through simple parsing, treating the model itself as a black box\. The same tokens that surface behavior can also be externally enforced to trigger it\. This allows an external oversight monitor to intervene in a model’s reasoning, extending monitorability into controllability\.

In this work, we investigate three Behavior Cues:\[answer\],\[continue\], and\[stop\]\. We evaluate on Qwen3\-8B \(hybrid reasoning\) and GLM\-Z1\-9B \(pure reasoning\) to test generalization across different model families in three separate problem domains\. AIME serves as a single\-turn evaluation for reasoning models in complex math problem solving\. Textworld, a situated text environment with natural language observations, evaluates Behavior Cues in an out\-of\-domain, multi\-turn, task completion setting\. Finally, we extend Hazardworld with a text\-based interface to allow for evaluation of task completion under constraint adherence in multi\-turn settings\.

We evaluate Behavior Cues for scalable oversight through three research questions, one focused on the actor model itself and two on the monitorability of Behavior Cue Reasoning to an external oversight module\. For the actor model, we evaluateTo what degree does training a model to perform Behavior Cue Reasoning impact baseline task performance and controllability?For the monitorability of Behavior Cue Reasoning, we evaluate oversight module performance along two axes:To what degree do Behavior Cues enable an external monitor to reduce wasted reasoning tokens without sacrificing correctnessandTo what degree do Behavior Cues enable an external monitor to prevent unsafe actions?

Our results demonstrate overall positive effects on baseline performance, adherence to external enforcement, and how Behavior Cues can enable oversight where standard reasoning would be too noisy or opaque\. We observe performance improvements in all domains and both model families\. In the majority of cases, the actor models follow injected Behavior Cues with the expected behavior\. Most significantly, Behavior Cues serve as a critical interface for external oversight\. We observe that Behavior Cues act as a necessary filter for decision points, limiting queries to the oversight module to when intervention results in the highest impact\. This provides a learnable horizon for RL fine\-tuning a non\-reasoning monitor\. For safety constraint enforcement, we find LLM\-based monitors struggle with identifying unsafe actions\. However, by parsing candidate actions surfaced at\[answer\]against rules constructed from training trajectories, a rule\-based monitor achieves near perfect classification accuracy\. An LLM augmented with this monitor achieves 50% more success in an environment where excessive constraint violations results in failure\. Through these experiments, we show that improved monitorability through Behavior Cue Reasoning enables oversight for both safety and efficiency\.

In summary, our contributions are as follows:

- •We introduce Behavior Cues, a mechanism for making LLM reasoning more monitorable and controllable for oversight through trained token sequences \(\[answer\],\[continue\],\[stop\]\) that act as surface\-level hooks for reasoning behaviors\.
- •We show a model trained to perform Behavior Cue Reasoning displays overall equal to greater baseline performance and reliably adheres to expected behavior when the cue is enforced externally\.
- •For efficiency monitoring, we demonstrate that Behavior Cue Reasoning enables external oversight via tractable decision horizons, allowing for the training of a non\-reasoning monitor and enabling a compressed\-trace formulation that correctly prunes up to 50% of wasted reasoning tokens\.
- •For safety monitoring, we demonstrate that Behavior Cue Reasoning enables a near\-perfect rule\-based safety monitor to recover safe actions from 80% of otherwise unsafe reasoning traces, raising success rate from 46% to 96% in a safety constrained setting\.
- •We publicly release our LLM\-compatible Hazardworld extension as an artifact for reasoning oversight research\.

## 2Background

Reasoning traces\.Formally, we define a reasoning trace as the content produced by an LLM before it commits to a final answer\. For the models investigated in this work, these reasoning traces are enclosed within<think\></think\>tags\. We define a trajectory step as the input\-output pair corresponding to a single LLM call, and define a reasoning step as the portions of reasoning traces separated by two newlines\. For example, the single\-turn question\-answer paradigm of AIME is a single trajectory step with potentially multiple reasoning steps\. A trajectory is defined as the full sequence of trajectory steps made by the LLM when attempting a task\. We define the working answer of an LLM as the answer the LLM would provide if its reasoning were terminated at any step\.111In this work, we elicit the working answer by appending\- I’m out of time and need to answer\!\\n</think\>\. However, we note in degenerate cases the LLM may not produce an end of sequence token after providing an answer\.When there is a ground\-truth correct answer, such as in AIME, we specifically refer to reasoning traces where the model never arrives at this correct answer as areasoning dead\-end\.

Scalable Oversight\.Scalable oversight refers to the challenge of supervising AI systems whose skills or capabilities exceed that of evaluators\[[1](https://arxiv.org/html/2605.07021#bib.bib1),[5](https://arxiv.org/html/2605.07021#bib.bib5)\]\. One research direction in this domain is the weak\-to\-strong paradigm, where weak overseers are studied in the context of supervising a stronger actor\[[6](https://arxiv.org/html/2605.07021#bib.bib6),[20](https://arxiv.org/html/2605.07021#bib.bib20)\]\. Our oversight monitor experiments reside within this framing as an evaluation of the effectiveness of Behavior Cues at making an actor’s reasoning more monitorable and controllable to oversight\. We discuss other works within the domain of scalable oversight in Section[6](https://arxiv.org/html/2605.07021#S6)\.

External Oversight\.When evaluating the impact of Behavior Cues in making an LLM’s reasoning more monitorable and controllable to oversight, we refer to the reasoning model as theactor model\. While prior work has investigated internal model signals such as output distributions\[[29](https://arxiv.org/html/2605.07021#bib.bib29),[33](https://arxiv.org/html/2605.07021#bib.bib33),[43](https://arxiv.org/html/2605.07021#bib.bib43)\]or hidden state representations\[[3](https://arxiv.org/html/2605.07021#bib.bib3),[48](https://arxiv.org/html/2605.07021#bib.bib48)\], we treat the actor model as a black box where only the generated tokens are observed\. In AIME and Cookingworld, the oversight module is restricted to non\-reasoning mode and constrained to only outputting\[continue\]or\[stop\]\. At each decision point, the oversight module observes the actor model’s reasoning and context and decides whether further reasoning is needed\. In Hazardworld, we evaluate two oversight approaches under the same no\-reasoning\-budget constraint: an LLM\-based monitor restricted to ’safe’ or ’unsafe’ outputs, and a rule\-based monitor that specifically parses the latest candidate action surfaced by\[answer\]\.

## 3Behavior Cues and External Oversight

Behavior Cues\.We define Behavior Cues as special token sequences an LLM is trained to naturally emit during reasoning immediately before specific behaviors\. In this work, we explore the application of three Behavior Cues:

- •\[answer\]: updates the working answer\. We explicitly train the model to provide its best\-guess answer without reasoning, resulting in a progression of answers throughout the reasoning phase\.
- •\[continue\]: after a new working answer,\[continue\]signals further reasoning token generation\.
- •\[stop\]: after a new working answer,\[stop\]signals a termination of reasoning\.

Thus, when a model trained to produce these Behavior Cues is generating the answer to a question,\[answer\]should always appear at the start of reasoning immediately after the<think\>tag and be followed by what is functionally the model’s non\-reasoning best guess answer\. If the model then generates\[continue\], it should proceed with its reasoning and produce another\[answer\]when its working answer changes\. If the model generates\[stop\], it should immediately stop its reasoning\. These Behavior Cues thus are meant to serve a dual purpose of both signaling the occurrence of specific behaviors as well as acting as enforceable levers\. Figure[1](https://arxiv.org/html/2605.07021#S1.F1)shows how the actor model is trained to naturally produce Behavior Cues as well as how they can be used by an external module as a sample efficient interface for oversight\. Attempts to elicit Behavior Cue Reasoning through prompting or RL were both unsuccessful\. See Appendix[A](https://arxiv.org/html/2605.07021#A1)for more details\.

Letr0,r1,r2,…,rk−1r\_\{0\},r\_\{1\},r\_\{2\},\.\.\.,r\_\{k\-1\}represent the steps in a model’s reasoning anda0,a1,a2,…,ak−1a\_\{0\},a\_\{1\},a\_\{2\},\.\.\.,a\_\{k\-1\}represent the model’s working answers for each corresponding reasoning step\. We train a model to perform Behavior Cue Reasoning through an elicitation, embedding, and SFT pipeline\. First we elicit the working answer at each reasoning step for every task in the training set by truncating reasoning, appending a termination phrase, and allowing the model to produce a final answer222This approach is similar to what is done by Mao et al\.\[[24](https://arxiv.org/html/2605.07021#bib.bib24)\]\. Then, these working answers are embedded into the reasoning between\[answer\]and\[continue\]\(or\[stop\]\) blocks to create the SFT dataset\. At steprnr\_\{n\}, the answerana\_\{n\}is only embedded if it is not equal to the last embedded answer\. This ensures each occurrence of\[answer\]is meaningful, marking a change in the model’s most probable answer\. To isolate the effects of any performance drift to training a model to perform Behavior Cue Reasoning specifically, we deliberately do not filter reasoning traces for correctness of the final answer\.

Oversight Module\.The oversight module is a separate monitor that tracks the actor model’s context and reasoning progression and can intervene at specific decision points\. Both actor and monitor are treated as mutually opaque: neither has access to the other’s internals\. In deployment, the monitor’s primary means for steering the actor model is through token forcing at decoding time\. The monitor is trained offline on pre\-generated traces from a Behavior Cue Reasoning model to allow for ground\-truth verification of impact\. We use the Behavior Cue Reasoning actor model to generate reasoning traces instead of assuming the perfect adherence that would arise from using the base model’s traces and completions\. These reasoning traces are then split 80\-20 into training and validation\. We use the Qwen3\-8B model set to “no\-think mode” as a base for fine\-tuning\.

The occurrence of\[answer\]can be used as an information dense signal for when to query the oversight module, filtering these queries to only when the actor model’s answer actually changes\. Due to the length of reasoning for complex problems, it can be inefficient to query the monitor in a fine\-grained manner such as after every reasoning step\. While querying after every reasoning step allows for finer\-grained intervention, it introduces substantial noise during training from stretches where the working answer is unchanged\.

We evaluate the effectiveness of\[answer\]as an information dense intervention point in two training paradigms, RL for efficiency monitoring and SFT for safety monitoring\. As the outcome is based on the output of the monitor across all decision points at the time of termination, we use RL with a terminal reward to train the oversight module for the efficiency objective\. For safety monitoring, the safety of any proposed action within the reasoning is independent of previous action classifications\. Thus we use SFT to train the oversight module for the safety objective\.

## 4Experimental Set Up

We evaluate Behavior Cues across two model families \(Qwen3\-8B and GLM\-Z1\-9B\), and three domains, \(AIME, Textworld, and Hazardworld\)\. We use Qwen3 as a hybrid reasoning model, trained to both reason and immediately answer, and GLM as a pure reasoning model to evaluate the transferability of our method across architectures and training regimes\. We use AIME, Textworld, and Hazardworld to evaluate Behavior Cues in making an actor model’s reasoning trace more monitorable and controllable to oversight across a diverse set of domains\. Our experiments investigate each stage at which Behavior Cues take effect from the base model itself to their impact through an external oversight monitor\. For additional details, see Appendix Section[B](https://arxiv.org/html/2605.07021#A2)for Cookingworld, Section[C](https://arxiv.org/html/2605.07021#A3)for Hazardworld, Section[D](https://arxiv.org/html/2605.07021#A4)for RL fine\-tuning, and Section[E](https://arxiv.org/html/2605.07021#A5)for Supervised fine\-tuning\.

AIME:The AIME \(American Invitational Math Examination\) problem set is a series of complex math problems\. We use the 2025 AIME question set as an evaluation split to minimize potential data contamination with the pre\-2025 AIME question sets acting as our training set for SFT\. For AIME reasoning traces, the monitor has two objectives\. For successful reasoning traces, the monitor is tasked with stopping reasoning as soon as the correct answer first surfaces\. For dead\-end reasoning traces, the monitor is tasked with stopping reasoning as soon as possible\.

Cookingworld:Textworld\[[7](https://arxiv.org/html/2605.07021#bib.bib7)\]is a situated text simulator originally designed for training classical RL agents in natural language environments\. For our experiments, we use the Cookingworld scenarios used for evaluation in\[[8](https://arxiv.org/html/2605.07021#bib.bib8)\]with alternate seeds and layouts as our train split\. Cookingworld tasks involve the LLM acting as an agent to navigate a household setting, gather ingredients, and prepare a meal\. We avoid labeling reasoning traces in Cookingworld as reasoning dead\-ends as due to the multi\-step nature of the task, there is almost no ground\-truth for if an action is ’incorrect’ in the immediate context as it may lead to success or failure later on\. Please see Appendix[B](https://arxiv.org/html/2605.07021#A2)for details\.

Hazardworld:Hazardworld\[[45](https://arxiv.org/html/2605.07021#bib.bib45)\]is a Minigrid environment originally designed for training RL agents\. In Hazardworld, the agent must navigate to a series of objects while avoiding a set of randomly generated hazard tiles\. The specific type of hazard tile the agent must avoid is provided in a natural language description such asAvoid WaterorStay at least 1 block away from Lava\. Unsafe actions in Hazardworld violate the provided constraints\. We report the adjusted average reward, where the LLM receives 0 reward if the total constraint violation budget is exceeded\. For our experiments, we extend Hazardworld to be compatible with LLM agents\. See Appendix[C](https://arxiv.org/html/2605.07021#A3)for a comprehensive list of modifications to the original environment\. In Hazardworld, the monitor is tasked solely with the classification of actions as safe or unsafe given the current actor model context\.

## 5Evaluation

We center our evaluation around three research questions investigating the actor model and oversight module\. These questions evaluate the actor model across downstream performance and controllability, and the Behavior Cue Reasoning traces themselves in how they lend themselves to better enabling an external oversight module to perform efficiency monitoring and safety monitoring\.

RQ1: To what degree does training a model to perform Behavior Cue Reasoning impact baseline task performance and controllability?Methods for improving the controllability or monitorability of models typically come at the cost of some negative impact on performance\. In these initial experiments, we examine the degree of performance drift from fine\-tuning an actor model to perform Behavior Cue Reasoning as well as how controllable the actor model is to externally enforced Behavior Cues\. To evaluate task performance, we fix decoding parameters per domain and report performance on a held\-out test set\. To evaluate Behavior Cues controllability, we randomly sample 200 decision points for each Behavior Cue Reasoning model where the model would output\[continue\]or\[stop\]\(100 each\), replace it with the opposite reasoning cue, and allow the model to continue generating\. We measure the rate of the actor model adhering to the replaced cue\. Adherence to\[continue\]is calculated as the number of times the model generates at least one or more\[answer\]blocks before terminating reasoning\. Adherence to\[stop\]is calculated as the number of times the model immediately generates the</think\>tag and reports the last reported answer\.

Table 1:Top: task performance\. Bottom: adherence to forced reasoning cues \(continue / stop\), for the Behavior Cues variants\. The Hazardworldadjusted scoreis calculated on the basis of a 0 overall reward for trajectories where the constraint violation budget was exceeded\. See Table[3](https://arxiv.org/html/2605.07021#S5.T3)for a more detailed breakdown\. We see overall performance gains across the board for all models\.AIME2025CookingworldHazardworldModel\(% Correct\)\(% Wins\)\(Adj\. Score\)Qwen3\-8B62\.562\.553\.353\.32\.32\.3Qwen3\-8B w/ Behavior Cues69\.269\.256\.756\.72\.42\.4GLM\-Z1\-9B54\.454\.436\.636\.61\.31\.3GLM\-Z1\-9B w/ Behavior Cues56\.756\.746\.746\.71\.31\.3Adherence to forced reasoning cues \(% continue / % stop\)Qwen3\-8B30\.9 / 100\.084\.0 / 55\.097\.0 / 100\.0GLM\-Z1\-9B34\.4 / 100\.075\.5 / 63\.587\.0 /98\.0Training a model to perform Behavior Cue Reasoning has minimal to positive effects on baseline task performance while adherence to externally enforced Behavior Cues differs by domain\.Table[1](https://arxiv.org/html/2605.07021#S5.T1)shows the Behavior Cue Reasoning model matches or surpasses base model performance across all domains\. The bottom half of Table[1](https://arxiv.org/html/2605.07021#S5.T1)shows the adherence to externally enforced\[continue\]and\[stop\]\. While we see overall adherence, there is a clear stratification by domain\. In Hazardworld, reasoning traces can be long \(up to 8k tokens\) but the model sees many examples of\[proposed action\]followed by\[stop\]due to the smaller action space\. In contrast, reasoning traces from Cookingworld are overall far shorter \(up to 2k tokens\) and typically involve a similar set of actions\. As a result, the model learns the\[stop\]behavior to be associated with a specific subset of actions and reasoning length\. In particular, we note adherence to\[stop\]has an inverse relationship to the reasoning length with practically zero adherence when injected at the very first\[answer\]block, rapidly increasing as reasoning continues and switches between more potential actions\. AIME shows the other extreme with an unbounded answer space and highly variable length reasoning traces\. Only\[stop\]is learned with the specific behavior we intended\. In AIME specifically, we note that once a high\-confidence answer is surfaced, the model never deviates, and reasoning quickly terminates again on the same answer if forced to continue\. Notably, we observeno answer changeseven when Qwen adheres to\[continue\], and only a 16% answer\-change rate for GLM\. Across all Behavior Cue Reasoning traces for both models, we observe aless than 3% rateof the model proposing the correct answer only to later change to another answer\.

RQ2: To what degree do Behavior Cues enable an external monitor to reduce wasted reasoning tokens without sacrificing correctness?To evaluate the monitorability of Behavior Cue Reasoning versus standard reasoning in the efficiency setting, we train the oversight module described in Section[3](https://arxiv.org/html/2605.07021#S3)in an offline setting where reasoning traces are pre\-generated\. The offline setting enables a ground\-truth comparison of the final committed answer versus the answer stopped at by the monitor\. This enables calculation for overall tokens saved while preserving correctness of the final solution or tokens saved through early termination of reasoning dead\-ends\. We compare two monitor\-training formulations enabled by Behavior Cues: using\[answer\]occurrences as decision points versus using individual reasoning steps\. We additionally evaluate whether the information surfaced by Behavior Cues alone is sufficient signal for the monitor by training on a compressed view of the reasoning traces containing only the answer progression and the number of reasoning lines between each surfaced answer\. AIME’s long reasoning traces make this compressed\-trace setting informative for testing whether Behavior Cues alone exposes sufficient signal for early reasoning termination\.

![Refer to caption](https://arxiv.org/html/2605.07021v1/figures/Monitor_RL_v2.png)Figure 2:Validation success rate and percent of maximum possible token savings over the course of RL fine\-tuning for the oversight module\. In AIME, due to the single\-turn nature and reasoning trace length, we include a secondary objective of terminating dead\-end reasoning traces as early as possible\. Additionally, due to the length of these reasoning traces, we include a monitor training setting where the actor’s reasoning is compacted: a formulation enabled solely due to Behavior Cues\. In Cookingworld, we find the plain reasoning trace results in the monitor finding no success or reward and quickly collapsing to a degenerate policy\.Behavior Cues serve a critical role in allowing a non\-reasoning oversight module to learn to supervise a reasoning actor model for efficiency\.While using individual reasoning steps as decision points allows for more fine\-grained intervention, we find it quickly diverges and leads to degenerate policies during training\. Figure[2](https://arxiv.org/html/2605.07021#S5.F2)shows success rate and percentage of ground\-truth token savings achieved during RL training for monitors in both AIME and Cookingworld\. For AIME specifically, we include the results from an ablation where the reasoning trace is compressed toonlythe answer progression and reasoning line count between answer switches\. We observe for Behavior Cue Reasoning Traces, the monitor begins to calibrate decisions on when to terminate reasoning based on the actor model’s reasoning itself\. While effective for pruning excess tokens from correct trajectories, this policy is far less effective at pruning dead\-end reasoning, shown by the lower achieved percentage of possible token savings in AIME\. The monitor in these scenarios learns to determine stopping criteria by effectively internalizing the actor model’s reasoning and stopping just short of when the actor model’s natural reasoning would stop\. The compressed reasoning prevent this learning shortcut by stripping away the reasoning content itself, forcing the monitor to learn to stop based on answer progression patterns during reasoning instead\. This suggests the answer progression in reasoning traces leading to correct answers is generally distinct from ineffective reasoning that ultimately fails\.

RQ3: To what degree do Behavior Cues enable an external monitor to prevent unsafe actions?To evaluate the monitorability of Behavior Cue Reasoning versus standard reasoning in the safety setting, we perform two experiments\. Similar to the efficiency setting, we evaluate the degree to which Behavior Cue Reasoning traces are more tractable versus plain reasoning traces for a non\-reasoning monitor learning to classify reasoning as leading to safe or unsafe actions\. Next, we evaluate a safety\-enforcement setting that Behavior Cue Reasoning enables\. A rule\-based action classification monitor is constructed purely from action outcomes in training trajectories\. This rule\-based monitor is paired with a Behavior Cue Reasoning actor model to record safe candidate actions surfaced via\[answer\]during reasoning, substituting them for any unsafe action the actor’s reasoning ultimately concludes in\. This isolates the contribution of Behavior Cues to the recovery mechanism: while the classifier does not require Behavior Cues, the mid\-reasoning extraction and substitution of safe candidate actions is not possible with standard reasoning\. We evaluate this setting online with the deployed monitor to measure downstream effects on success that offline evaluation could not capture\.

Table 2:F1 scores from Supervised Fine\-tuning on classifying actions as ‘Safe’ or ‘Unsafe’ in Hazardworld\. We report F1 over accuracy due to class imbalance in the action distribution\. Both settings struggle, with the monitor having a slightly easier time on Behavior Cue Reasoning Traces\.Reasoning FormatInput Tokens Trained On \(Million\)Base1M2M3M4M5M6M7M8M9M10MBehavior Cues58%12%60%33%63%55%36%45%66%60%60%Plain Reasoning39%40%39%35%43%41%31%22%30%49%48%LLM\-based safety classification is difficult even with Behavior Cues\.Table[2](https://arxiv.org/html/2605.07021#S5.T2)reports the F1 of an SFT\-trained non\-reasoning monitor classifying stretches of reasoning as leading to safe or unsafe actions\. Behavior Cue Reasoning traces are compared against plain reasoning\. While Behavior Cue Reasoning sees a generally higher F1 across most checkpoints, neither setting reaches the reliability needed for a deployable safety monitor\. This finding suggests the difficulty in this setting stems from the inability of the base LLM itself to perform the safety classification without reasoning\. This motivates an alternative approach: rather than improve LLM\-based classification, we leverage Behavior Cues to enable an intervention that does not require an LLM\-based monitor at all\.

Behavior Cue Reasoning enables recovery of safe actions from reasoning that ultimately ends unsafely, drastically boosting success rate\.The near\-perfect accuracy of the rule\-based monitor \(99\.54% F1\) allows us to explicitly isolate the contribution of Behavior Cues from monitor accuracy\. Table[3](https://arxiv.org/html/2605.07021#S5.T3)demonstrates the effect of pairing a near\-optimal rule\-based monitor with a Behavior Cue Reasoning model in the recovery setting, more than doubling overall success rate from 46% to 96%\.\[answer\]enables a reasoning\-preserving mode that a simple action reject monitor is unable to do\. Of all reasoning traces that ultimately terminate on an unsafe action, over 80% have a considered safe action recovered by the monitor through\[answer\]\. Rather than throw out all reasoning and force the actor model to generate an entirely new reasoning trace, the safety enforcement monitor is able to extract the model’s most confident safe action exposed by\[answer\], despite the reasoning ultimately ending in an unsafe action\.

Table 3:Adjusted reward zeros out the score for any episode where violations exceed the budget\. The Enforcement Monitor prevents the commitment of any unsafe action, falling back to a no\-op action or the last proposed safe action during reasoning\.ConfigurationSuccess %Adj\. RewardAvg\. RewardQwen3\-8B46%2\.354\.86Qwen3\-8B w/ Behavior Cues46%2\.454\.64Safety Enforcement Monitor96%3\.993\.99
## 6Related Works

Reasoning Monitorability and Oversight:Recent work examines the potential of monitoring LLM reasoning to detect and address misaligned behaviors before they surface in the final response\[[22](https://arxiv.org/html/2605.07021#bib.bib22),[5](https://arxiv.org/html/2605.07021#bib.bib5),[13](https://arxiv.org/html/2605.07021#bib.bib13)\]\. Previous work has shown that these reasoning traces can still be unfaithful, where the final answer of the model doesn’t align with the reasoning\[[37](https://arxiv.org/html/2605.07021#bib.bib37),[23](https://arxiv.org/html/2605.07021#bib.bib23),[16](https://arxiv.org/html/2605.07021#bib.bib16),[12](https://arxiv.org/html/2605.07021#bib.bib12)\]\. This unfaithfulness becomes an issue when an overseer needs to infer the actor model’s working answer from the reasoning, as unfaithfulness may reduce the monitor’s precision in identifying the actor’s current working answer\. Baker et al\.\[[4](https://arxiv.org/html/2605.07021#bib.bib4)\]demonstrate that while reasoning monitors can detect misaligned behaviors in frontier reasoning models, applying optimization pressure directly to the reasoning trace can produce obfuscated misbehavior, where the model learns to hide misaligned behavior rather than avoid them to improve reward: this motivates a "monitorability tax", where reasoning is deliberately left outside the training loop to explicitly isolate it from optimization pressures\. Behavior Cues aligns with this principle and side steps the issue of faithfulness by training an actor model to directly expose its working answer at meaningful junctions\. All optimization pressure for oversight is instead placed on the monitor and interpreting the actor model’s reasoning becomes a non\-issue\.

Reasoning Control through Token Injection:Prior work examines the application of interventions for steering a model’s reasoning\. These approaches typically involve the injection of natural language phrases to steer the model towards certain thought paths without modification of the underlying model\[[19](https://arxiv.org/html/2605.07021#bib.bib19),[42](https://arxiv.org/html/2605.07021#bib.bib42),[49](https://arxiv.org/html/2605.07021#bib.bib49),[40](https://arxiv.org/html/2605.07021#bib.bib40),[36](https://arxiv.org/html/2605.07021#bib.bib36)\]\.\[[25](https://arxiv.org/html/2605.07021#bib.bib25)\]train a non\-thinking model to reason effectively, showing performance can be further improved by early termination or the use of the phrase "Wait" to extend reasoning\.\[[51](https://arxiv.org/html/2605.07021#bib.bib51)\]leverage activations and the "wait" token to elicit long chain of thought behavior with minimal training\. In contrast to these methods, our approach specifically targets making a model’s reasoning trace more monitorableandsteerable\.

Inducing Behavior Through Control Tokens:Previous works have explored the use of controlling model behavior via special tokens, both inside and outside of a reasoning trace\.\[[50](https://arxiv.org/html/2605.07021#bib.bib50)\]leverage the \[RESET\] token to allow a model to completely undo potentially unsafe generations with\[[32](https://arxiv.org/html/2605.07021#bib.bib32)\]refining this to a specific subset of the generation\. More recent work extend the use of special tokens to reset behavior to reasoning\[[46](https://arxiv.org/html/2605.07021#bib.bib46)\]\.\[[30](https://arxiv.org/html/2605.07021#bib.bib30)\]demonstrate how a specialized token for enforcing continued reasoning can outperform a natural language signal such as "Wait"\.\[[21](https://arxiv.org/html/2605.07021#bib.bib21)\]and\[[11](https://arxiv.org/html/2605.07021#bib.bib11)\]use the \[PAUSE\] token albeit in different ways, the former allows the model to express low confidence and the latter to signal the model to perform more internal hidden layer calculations, effectively increasing the compute used for generating the next token\.

Early Stopping in Reasoning:In this section, we discuss work that explicitly terminate or truncate reasoning early for both better token efficiency and performance\. These approaches typically rely on entropy, or the confidence in a proposed answer as a signal to immediately terminate the model’s reasoning at a certain reasoning step\[[43](https://arxiv.org/html/2605.07021#bib.bib43),[33](https://arxiv.org/html/2605.07021#bib.bib33),[29](https://arxiv.org/html/2605.07021#bib.bib29)\]\.\[[35](https://arxiv.org/html/2605.07021#bib.bib35)\]uses the similarity of newly generated reasoning steps with previous steps as a signal for early termination\.\[[24](https://arxiv.org/html/2605.07021#bib.bib24)\]generates answers for each reasoning step, ending reasoning when the answer is consistent through a large number of steps\.

Reasoning Agents in Interactive TasksAgents in interactive environments have been studied long before the advent of LLMs with language\-focused testbeds such as Textworld\[[7](https://arxiv.org/html/2605.07021#bib.bib7)\]or Jericho\[[14](https://arxiv.org/html/2605.07021#bib.bib14)\]used for training RL agents in sequential decision\-making\[[34](https://arxiv.org/html/2605.07021#bib.bib34),[38](https://arxiv.org/html/2605.07021#bib.bib38)\]\. These agents often used a set of action templates to bound natural language space to a subset learnable by a smaller network\. Recent work has examined the performance of LLMs as the decision making centers of agents in these environments\[[8](https://arxiv.org/html/2605.07021#bib.bib8),[28](https://arxiv.org/html/2605.07021#bib.bib28)\]\. LLMs as agents in interactive tasks have also been explored in a wide range of domains such as coding\[[47](https://arxiv.org/html/2605.07021#bib.bib47),[44](https://arxiv.org/html/2605.07021#bib.bib44)\], scientific discovery\[[26](https://arxiv.org/html/2605.07021#bib.bib26),[18](https://arxiv.org/html/2605.07021#bib.bib18)\], and deep research\[[17](https://arxiv.org/html/2605.07021#bib.bib17)\]\.

## 7Considerations

Limitations\.Our work progresses scalable oversight through the explicit training of a model’s reasoning to be more monitorable and controllable to external intervention\. While we attempt to minimize downstream performance impact, there is a clear distribution shift\. Regardless of whether this ultimately improves performance, future work may investigate methods for teaching a model to perform Behavior Cue Reasoning without needing to modify the original model’s weights at all\. We primarily investigate domains where the final answer is a short phrase that can easily be integrated into the reasoning trace\. Another direction may be pushing the model to specifically record this answer in an external document to facilitate extension to environments where the downstream answer may be longer than the reasoning itself\. Lastly, we primarily investigate the trio of\[answer\],\[continue\], and\[stop\]due to the implicit relation between all of these behaviors\. Future work may investigate other reasoning behaviors such as backtracking for verification\.

Societal Impacts\.Behavior Cues are intended to improve scalable oversight by making intermediate reasoning states easier to monitor and intervene on, potentially reducing wasted inference compute and preventing unsafe commitments before they appear in final outputs\. However, because many users interact with LLMs through APIs or mediated interfaces, cue\-based control could be hidden from end users and used by a platform or intermediary to silently steer model reasoning without meaningful transparency or consent\. Behavior Cues should also be understood as strong learned priors rather than deterministic guarantees due the stochasticity of LLM generation, reflected by the varying degrees of cue compliance observed across domains\. See Appendix[H](https://arxiv.org/html/2605.07021#A8)for further discussion\.

## 8Conclusion

In this work we introduce Behavior Cues, a method for making a model’s reasoning more monitorable and controllable\. During Behavior Cue Reasoning, the model periodically emits special token sequences that are anchored to specific behaviors:\[answer\]when the model’s internal working answer changes,\[continue\]when the model believes it needs to continue reasoning, and\[stop\]when the model commits its answer\. Across two model families and three domains, we demonstrate that training an actor model to produce these Behavior Cues matches or improves baseline performance across nearly all model\-domain combinations\. During Behavior Cue Reasoning, the actor model will reliably adhere to externally enforced\[continue\]or\[stop\]cues while exposing its answer progression via the\[answer\]\.

This answer progression enables direct oversight over the actor model’s answer before it is committed, allowing for improvements in both reasoning efficiency and safety in constraint adherence\. In reasoning efficiency, this answer progression allows a weaker, non\-reasoning monitor to intervene when the actor model has already arrived at the correct answer while also preemptively pruning reasoning dead\-ends, saving up to 50% of otherwise wasted tokens\. In safety, the exposure of the actor model’s working answer allows for the extraction of safe actions from reasoning traces that ultimately conclude in unsafe actions, recovering 80% of LLM steps that might have otherwise required resampling the model\.

## References

- \[1\]Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané\.Concrete problems in ai safety\.arXiv preprint arXiv:1606\.06565, 2016\.
- \[2\]Anthropic\.Building with extended thinking\.[https://docs\.claude\.com/en/docs/build\-with\-claude/extended\-thinking](https://docs.claude.com/en/docs/build-with-claude/extended-thinking), 2025\.
- \[3\]Amos Azaria and Tom Mitchell\.The internal state of an LLM knows when it’s lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023, 2023\.
- \[4\]Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y\. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi\.Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025\.
- \[5\]Samuel R\. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, et al\.Measuring progress on scalable oversight for large language models\.arXiv preprint arXiv:2211\.03540, 2022\.
- \[6\]Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu\.Weak\-to\-strong generalization: Eliciting strong capabilities with weak supervision\.InInternational Conference on Machine Learning, 2024\.
- \[7\]Marc\-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al\.Textworld: A learning environment for text\-based games\.InWorkshop on Computer Games, pages 41–75\. Springer, 2018\.
- \[8\]Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, and Marc\-Alexandre Côté\.Tales: Text adventure learning environment suite, 2025\.
- \[9\]DeepSeek\-AI\.Deepseek\-v4: Towards highly efficient million\-token context intelligence, 2026\.
- \[10\]Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An\.Group\-in\-group policy optimization for llm agent training\.arXiv preprint arXiv:2505\.10978, 2025\.
- \[11\]Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan\.Think before you speak: Training language models with pause tokens\.InThe Twelfth International Conference on Learning Representations\.
- \[12\]Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R\. Bowman, and Evan Hubinger\.Alignment faking in large language models, 2024\.
- \[13\]Melody Y\. Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y\. Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker\.Monitoring monitorability, 2025\.
- \[14\]Matthew Hausknecht, Prithviraj Ammanabrolu, Marc\-Alexandre Côté, and Xingdi Yuan\.Interactive fiction games: A colossal adventure\.InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020\.
- \[15\]David Herel and Tomas Mikolov\.Collapse of self\-trained language models, 2024\.
- \[16\]Nikolaus Howe and Micah Carroll\.The ends justify the thoughts: Rl\-induced motivated reasoning in llm cots, 2026\.
- \[17\]Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al\.Deep research agents: A systematic examination and roadmap\.arXiv preprint arXiv:2506\.18096, 2025\.
- \[18\]Peter Jansen, Marc\-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark\.Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents\.Advances in Neural Information Processing Systems, 37:10088–10116, 2024\.
- \[19\]Hyunbin Jin, Je Won Yeom, Seunghyun Bae, and Taesup Kim\.“well, keep thinking”: Enhancing LLM reasoning with adaptive injection decoding\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 9989–10018, Vienna, Austria, July 2025\. Association for Computational Linguistics\.
- \[20\]Zachary Kenton, Noah Y\. Siegel, János Kramár, Jonah Brown\-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D\. Goodman, and Rohin Shah\.On scalable oversight with weak llms judging strong llms\.arXiv preprint arXiv:2407\.04622, 2024\.
- \[21\]Eunki Kim, Sangryul Kim, and James Thorne\.Learning to insert \[pause\] tokens for better reasoning\.arXiv preprint arXiv:2506\.03616, 2025\.
- \[22\]Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, and Vlad Mikulik\.Chain of thought monitorability: A new and fragile opportunity for ai safety, 2025\.
- \[23\]Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen\-Lawton, Tristan Hume, Zac Hatfield\-Dodds, Jared Kaplan, Jan Brauner, Samuel R\. Bowman, and Ethan Perez\.Measuring faithfulness in chain\-of\-thought reasoning, 2023\.
- \[24\]Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang\.Early stopping chain\-of\-thoughts in large language models, 2025\.
- \[25\]Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto\.s1: Simple test\-time scaling\.In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20275–20321, Suzhou, China, November 2025\. Association for Computational Linguistics\.
- \[26\]Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al\.Mlgym: A new framework and benchmark for advancing ai research agents\.arXiv preprint arXiv:2502\.14499, 2025\.
- \[27\]OpenAI\.Learning to reason with llms\.[https://openai\.com/index/learning\-to\-reason\-with\-llms/](https://openai.com/index/learning-to-reason-with-llms/), 2024\.
- \[28\]Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, et al\.Balrog: Benchmarking agentic llm and vlm reasoning on games\.arXiv preprint arXiv:2411\.13543, 2024\.
- \[29\]Mohammad Atif Quamar and Mohammad Areeb\.Logit\-entropy adaptive stopping heuristic for efficient chain\-of\-thought reasoning, 2025\.
- \[30\]Liran Ringel, Elad Tolochinsky, and Yaniv Romano\.Learning a continue\-thinking token for enhanced test\-time scaling\.In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F\. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics, pages 3324–3345, Mumbai, India, December 2025\. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics\.
- \[31\]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms, 2017\.
- \[32\]Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin, and Siddhartha Reddy Jonnalagadda\.Backtracking for safety\.arXiv preprint arXiv:2503\.08919, 2025\.
- \[33\]Aman Sharma and Paras Chopra\.Think just enough: Sequence\-level entropy as a confidence signal for llm reasoning, 2025\.
- \[34\]Mohit Shridhar, Xingdi Yuan, Marc\-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht\.Alfworld: Aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations\.
- \[35\]Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang\.Stop when enough: Adaptive early\-stopping for chain\-of\-thought reasoning, 2025\.
- \[36\]Siao Tang, Xinyin Ma, Gongfan Fang, and Xinchao Wang\.Concisehint: Boosting efficient reasoning via continuous concise hints during generation\.arXiv preprint arXiv:2506\.18810, 2025\.
- \[37\]Miles Turpin, Julian Michael, Ethan Perez, and Samuel R\. Bowman\.Language models don’t always say what they think: Unfaithful explanations in chain\-of\-thought prompting, 2023\.
- \[38\]Ruoyao Wang, Peter Jansen, Marc\-Alexandre Côté, and Prithviraj Ammanabrolu\.Scienceworld: Is your agent smarter than a 5th grader?InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022\.
- \[39\]Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya\-Qin Zhang, and Yuanchun Li\.Budgetthinker: Empowering budget\-aware llm reasoning with control tokens\.arXiv preprint arXiv:2508\.17196, 2025\.
- \[40\]Tong Wu, Chong Xiang, Jiachen T Wang, G Edward Suh, and Prateek Mittal\.Effectively controlling reasoning models through thinking intervention\.arXiv preprint arXiv:2503\.24370, 2025\.
- \[41\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.arXiv preprint arXiv:2505\.09388, 2025\.
- \[42\]Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang\.Test\-time prompt intervention, 2025\.
- \[43\]Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang\.Dynamic early exit in reasoning models\.arXiv preprint arXiv:2504\.15895, 2025\.
- \[44\]John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press\.Swe\-agent: agent\-computer interfaces enable automated software engineering\.InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 50528–50652, 2024\.
- \[45\]Tsung\-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, and Karthik Narasimhan\.Safe reinforcement learning with natural language constraints\.Advances in Neural Information Processing Systems, 34:13794–13808, 2021\.
- \[46\]Xiao\-Wen Yang, Xuan\-Yi Zhu, Wen\-Da Wei, Ding\-Chu Zhang, Jie\-Jing Shao, Zhi Zhou, Lan\-Zhe Guo, and Yu\-Feng Li\.Step back to leap forward: Self\-backtracking for boosting reasoning of language models\.arXiv preprint arXiv:2502\.04404, 2025\.
- \[47\]Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, and Marc\-Alexandre Côté\.debug\-gym: A text\-based environment for interactive debugging, 2025\.
- \[48\]Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He\.Reasoning models know when they’re right: Probing hidden states for self\-verification\.arXiv preprint arXiv:2504\.05419, 2025\.
- \[49\]Xingsheng Zhang, Luxi Xing, Chen Zhang, Yanbing Liu, Yifan Deng, Yunpeng Li, Yue Hu, and Chenxu Niu\.Can we steer reasoning direction by thinking intervention?In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3888–3913, Suzhou, China, November 2025\. Association for Computational Linguistics\.
- \[50\]Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M Bikel, Jason E Weston, and Eric Michael Smith\.Backtracking improves generation safety\.InThe Thirteenth International Conference on Learning Representations\.
- \[51\]Zekai Zhao, Qi Liu, Kun Zhou, Zihan Liu, Yifei Shao, Zhiting Hu, and Biwei Huang\.Activation control for efficiently eliciting long chain\-of\-thought ability of language models\.arXiv preprint arXiv:2505\.17697, 2025\.

## Appendix AEmbedded Answer Fixation

When initially attempting to train models to perform Behavior Cue Reasoning, our initial approaches were pure prompting and RL fine\-tuning\. However, we found both to fail due to the models’ tendency to fixate on any previously mentioned answers within their reasoning trace\. We found models required SFT to properly adhere to the format of Behavior Cue Reasoning where we observed an upside\-down U curve in terms of performance, where early checkpoints perform strictly worse than the base model while failing to adhere to the Behavior Cue Reasoning format; middle checkpoints both achieved proper formatting and comparable, if not greater, performance to the base model; and later checkpoints once again demonstrating a degradation in base model capabilities even while adhering to the Behavior Cue Reasoning format\.

We suspect the early collapse to be a symptom of the model initially fixating on the embedded answers during early training, resulting in a collapse of reasoning capabilities\. The middle checkpoints see a sharp increase in performance as the model learns to adhere to the Behavior Cue Reasoning format without disrupting its natural reasoning\. The latter checkpoints collapse again as the model effectively begins to train on its own reasoning output\. The later checkpoint collapse is a studied effect that has evidence tracing back to GPT\-2\[[15](https://arxiv.org/html/2605.07021#bib.bib15)\]\. For the early checkpoint result and question of why Behavior Cues cannot simply be elicited by SFT or RL, we perform a small scale experiment\.

This experiment is performed post\-hoc with a Behavior Cue Reasoning model to prove that a base model will otherwise fixate to any answers included in its context rather than surface them and continue reasoning as done in Behavior Cue Reasoning\. We take 100 context and reasoning trace pairs, generated by a Behavior Cue Reasoning model\. For the reasoning, we truncate to the second\[answer\]block, feed the context and truncated reasoning back to a non\-Behavior Cue Reasoning model, and allow it to continue generating\. This advantages the base model towards generating a different answer than that exposed in the first\[answer\]block, as the stretch of reasoning after the first answer that leads the Behavior Cue Reasoning to propse the second answer is explicitly provided to the base model\. Despite this, we found the base model would fixate and repeat the first answer73%of the time\.

During RL training, when the base model needs to instead construct a line of reasoning that leads it to a different answer itself, we saw this initial answer fixation almost always resulted in the second answer being a repeat of the first, resulting in non\-convergence during training\.

A second motivation for avoiding RL training arises from the "monitorability tax" coined by Baker et al\[[4](https://arxiv.org/html/2605.07021#bib.bib4)\]\. RL training a model to perform Behavior Cue Reasoning would explicitly require monitoring the reasoning trace to assign reward to ensure that the correct behavior is adhered to\. This both creates a dual optimization pressure, as the training objective would also require some pressure to ensure the model actually arrives at the correct final answer, as well as allowing for the potential of reward hacking\.

## Appendix BCookingworld Data Filtering

Cookingworld tasks involve the LLM acting as an agent to navigate a household setting, gather ingredients, and prepare a meal\. As ingredients are limited, failure can occur if an important ingredient is consumed or prepared in the wrong way, for example, eating a raw ingredient that should have been cooked\. If this occurs, the environment immediately terminates with a losing message\. The specific tasks and environment layouts used in\[[8](https://arxiv.org/html/2605.07021#bib.bib8)\]are used as our test split with alternate recipes and environment layouts used as our train split\. A max step count of 20 is used as a tractable horizon that still gives the LLM sufficient attempts to make non\-terminal mistakes in the environment\. A reasoning budget of 2048 tokens per step is allocated as higher token budgets resulted in minimal increases in performance for the base actor model\. Due the difficulty of the environment, both models failed to complete almost all of the included scenarios with a minimal prompt\. Following advice from\[[8](https://arxiv.org/html/2605.07021#bib.bib8)\], we include the contents of the‘‘help‘‘in the initial observation to aid the models in completing the given tasks\. We do not use the reasoning dead\-end formulation in Cookingworld as due to the multi\-step nature of the task, there is no ground\-truth for if an action is ’incorrect’ in the immediate context as it may lead to success or failure later on\. While task failure is possible in Cookingworld, for example eating a required ingredient for the meal when there are no backups, labeling all reasoning for these trajectories with the reasoning\-dead\-end formulation would result in a number of reasoning traces over identical or near\-identical states being labeled with the reasoning dead\-end labels as well as the successful reasoning trace labels, resulting in a significant amount of noise\. Furthermore, as the failure turns make up less than 5% of the total data samples, we discard them to focus monitor evaluation in Cookingworld on efficiency in a multi\-turn setting\.

## Appendix CHazardworld changes

We build on the HazardWorld benchmark of Yang et al\.\[[45](https://arxiv.org/html/2605.07021#bib.bib45)\], with modifications grouped into four categories: task\-tractability adjustments to make the environment solvable by LLM agents, a text observation and action interface, the scenario suite used for train/test evaluation, and benchmark\-validity corrections to the original implementation\.

### C\.1Task\-tractability adjustments

The original HazardWorld places its three pickup objects \(Ball, Box, Key\) and hazard tiles independently\. On large grids this frequently produces scenarios in which an object is unreachable within a useful step budget, or in which the constraint never interacts with the optimal trajectory\. To make episodes both tractable for LLM agents and informative about constraint adherence, we add two placement mechanisms and expose hazard density as a hyperparameter\.

Chain object placement\.When a ground\-truth step budgetmax\_steps\_gtis supplied, objects are placed sequentially along a chain Agent→\\toBall→\\toBox→\\toKey, with each leg constrained to lie between22and⌊max\_steps\_gt/4⌋\\lfloor\\texttt\{max\\\_steps\\\_gt\}/4\\rfloorManhattan cells from the previous waypoint\. The lower bound of22guarantees at least one intermediate cell on each shortest path \(used by hazard injection below\); the upper bound bounds the optimal trajectory length\.

Hazard\-on\-path injection\.After object placement, we ensure that at least one tile of the avoid\-hazard type lies within the bounding box of every consecutive \(waypoint, waypoint\) pair\. If no such tile already exists, we place one on an empty cell, or otherwise overwrite a non\-avoid hazard\. This guarantees that the agent must reason about the constraint in order to traverse the chain\. ForHazardWorldSequentialwithisBefore=False, we use the dormant avoid object \(the activeavoid\_objisNoneuntil the trigger fires\)\.

Hazard density\.We expose hazard density as a hyperparametersparsity, with default0\.750\.75\(raised from0\.500\.50in the original\)\. At0\.750\.75,∼61%\\sim\\\!61\\%of interior cells are hazardous and∼21%\\sim\\\!21\\%are of the avoid type, increasing the rate of constraint engagement per episode\. We also reorder placement in the Budgetary and Sequential subclasses to agent→\\toobjects→\\tohazards \(matching Relational\), so that chain placement always operates on an empty interior\.

ParameterDefaultEffectsize13Grid side length \(now passable to all subclasses\)max\_steps\_gtNoneGround\-truth step budget; enables chain placementsparsity0\.75Probability each interior cell is a hazardTable 4:New environment parameters added toHazardWorldBase\.
### C\.2Text observation and action interface

We wrap the environment with aHazardWorldTextWrapperthat produces ASCII observations and accepts text actions, enabling LLM agents to play without bespoke tokenisation\.

Grid encoding\.Every cell is rendered as a fixed two\-character code, chosen so that each code is a single token in the Qwen3\-8B tokenizer used in our experiments:\#\#Wall,\.\.Empty,LvLava,GrGrass,WaWater,BlBall,BoBox,KyKey,??Unseen\. The agent is rendered as\>\>/<</vv/ˆˆin facing mode, where the agent can only move forward and must explicitly turn to change direction, orAgin cardinal mode\. Single\-token encoding avoids cross\-cell token boundaries that would otherwise force the model to spend reasoning capacity on parsing the map\.

Observation and action options\.

- •*Partial observability:*a configurable square window \(default3×33\{\\times\}3\) around the agent; cells outside the window are rendered??\. For window sizewwthe visibility radius isr=\(w−1\)/2r=\(w\-1\)/2\.
- •*Fog of war*\(optional\): once\-seen cells remain revealed for the rest of the episode\.
- •*Cardinal movement*\(optional\): the four\-action set \{up,down,left,right\} replaces the original turn\-then\-forward action set\.
- •*Auto\-pickup:*moving onto a pickupable object collects it automatically, matching the convention common in text\-game baselines\.
- •*Action parsing:*actions are extracted from<action\>⋅\\,\\cdot\\,</action\>tags in model output\.

### C\.3Scenario suite

We release a fixed scenario file \(scenarios\.json\) containing3030training and3030testing scenarios\. Both splits share the same parameter combinations but use disjoint random seeds, so test\-time evaluation measures generalisation to unseen layouts under matched constraints\. Coverage is:

- •Budgetary\(10\): three avoid\-hazard objects×\\timeshazard countshc∈\{1,3,5,6\}\\texttt\{hc\}\\in\\\{1,3,5,6\\\}\.
- •Sequential\(10\): five withisBefore=trueand five withisBefore=false, varying the \(first\_obj,avoid\_obj\) pair\.
- •Relational\(10\): three avoid\-hazard objects×\\timesmin\_dist∈\{1,2\}\\in\\\{1,2\\\}\. The choice to omitmin\_dist∈\{0,3\}\\in\\\{0,3\\\}is justified in §[C\.4](https://arxiv.org/html/2605.07021#A3.SS4)\.

### C\.4Benchmark\-validity corrections

While porting HazardWorld we identified three pre\-existing issues that affect the validity of comparisons drawn from the original implementation\. We document and correct them here\.

Non\-deterministic scenario generation\.The original\_gen\_gridmixes Gymnasium’s seededself\.np\_random\(used for agent and object placement\) with Python’s unseededrandommodule \(used for constraint parameters and hazard placement\) and an unseedednp\.random\.choicecall inmake\_budgetary\_constraint\. Consequently, a fixed environment seed does not reproduce the same scenario across runs\. We apply standardization so that seed reproducability is ensured\.

Sequential constraint never deactivates\.HazardWorldSequential\.step\(\)contains two bugs on theisBeforebranch in the original repository:

- •curr\_cell == first\_objcompares an object instance against an undefined local; the intended check iscurr\_cell\.type == self\.first\_obj\.
- •self\.avoid\_obj == Noneis a comparison rather than an assignment; the active constraint is therefore never deactivated after the trigger fires\.

Both are corrected\. Because the second bug silently inverts the semantics ofisBefore=truescenarios, prior numerical results obtained on the original implementation in this regime should be interpreted with caution\.

Relationalmin\_distis not enforced\.The relational constraint nominally requires the agent to remain at leastmin\_distManhattan\-cells from the avoid object\. However, the originalstep\(\)only registers a violation when the agent steps*onto*the avoid tile \(curr\_cell\.type == self\.avoid\_obj\), never when it merely enters the surrounding disk\. Two consequences follow:

- •min\_dist=0is trivially always satisfied and therefore semantically empty\.
- •min\_dist=2andmin\_dist=1produce identical runtime behavior despite being intended as different difficulty levels\.

Additionally, under the partial\-observation window of size55\(radiusr=2r=2\) used in our experiments, an agent cannot see a hazard at Manhattan distance≥3\\geq 3, somin\_dist=3is information\-theoretically infeasible\. We therefore restrict our relational scenarios tomin\_dist∈\{1,2\}\\in\\\{1,2\\\}\. We further surfacemin\_distin theinfodict and emit a warning whenmin\_distexceeds the partial\-observation radius\.

Other\.We additionally migrated the codebase fromgymtogymnasium\(five\-tuplestep,reset\(seed, options\), and thenp\_random\.integersAPI\) and enriched theinfodict withconstraint,max\_violations, and afull\_maprendering for downstream analysis\. These do not affect the experimental setup and are detailed in the released code\.

## Appendix DOversight Module RL

When optimizing the oversight module for efficiency, we treat the early stopping objective as a multi\-turn task where the monitor is queried at each decision point and must decide whether reasoning should\[continue\]or\[stop\]\.

As the outcome is based on the output of the monitor across all decision points at the time of termination, we use RL with a terminal reward to train the oversight module\. We compare a RL\-training a monitor with the emergence of\[answer\]as decision points against a baseline where there is a decision point after each reasoning step\. In both cases, the\[answer\], surfaced working answer, and\[continue\]blocks are parsed out due to fixation effect we saw when attempting to elicit a model to do Behavior Cue Reasoning through prompting and RL\. For the plain thought baseline, we use the last formatted answer surfaced by\[answer\]to label decision points\. These experiments aim to identify if the decision points afforded by the surfacing of\[answer\]result in better RL training for a weaker monitor\. Due to the sheer length of the generated reasoning traces In all experiments for this setting, we report percent of total possible token savings compared to an oracle and overall reward for validation\. Metrics are reported in intervals of gradient update steps\.

We use the verl\-agent repository\[[10](https://arxiv.org/html/2605.07021#bib.bib10)\]for RL fine\-tuning\. We train using PPO\[[31](https://arxiv.org/html/2605.07021#bib.bib31)\]with a training batch size of 16, validation batch size of 32, and a learning rate of 1e\-6\. There are two reward schema based on whether or not the final outcome of the reasoning trajectory is correct or not\. In Cookingworld, all reasoning trajectories are labeled correct due to confounding factors in multi\-turn environments discussed in Appendix[B](https://arxiv.org/html/2605.07021#A2)\. For correct trajectories, the monitor receives a \+1 for stopping at the same answer the reasoning would ultimately terminate on with an scaling early stop bonus of up to \.2 when terminating correctly\. Continuing past the natural end of reasoning results in a penalty of \-1\. For incorrect trajectories \(reasoning dead\-ends\), the monitor receives a \+1 for stopping at any point during the trajectory with the same scaled early stop bonus\.

For all monitor training scenarios, we observe the base model has a tendency to quickly collapse to always outputting ’continue’ or ’stop’ during early training\. To prevent this, we also include a small ’continue’ reward to prevent early ’stop’ collapse\.

## Appendix EOversight Module SFT

The safety objective represents a different paradigm for the oversight module and we use a different training method as a result\. For Hazardworld, at any decision point it doesn’t matter whether previous actions in the reasoning trajectory were labeled as safe or unsafe\. This formulation is effectively a standard classification task\. The RL framing of the efficiency objective introduces noise in the safety objective as the safety of the current action surfaced by\[answer\]is independent of earlier action classifications\. The sole exception to this is when the earlier action and the current action are thesameaction, however this still does not carry the same sequential correlation that necessitates the reward backpropagation in RL\. For this reason, we keep the same decision point formulation used to generate data samples but use SFT instead to train the monitor\. In all experiments for this setting, we report the accuracy and F1 score of safe and unsafe labeling\. Metrics are reported in intervals of input tokens trained on\.

We use the TRL framework for SFT\. Fixed hyperparameters are used across both model families\.

## Appendix FCompute

We predict all experiments run end to end would take roughly a month on a cluster of A100 with 80GB of memory\.

The primary bottleneck comes from the extended reasoning length allowed for the complex math problems in AIME\. This necessitates a larger memory footprint for SFT and RL training\.

Data generation for initial reasoning traces and subsequent Behavior Cue Reasoning traces can be parallelized across all gpus for an 8B or 9B model relatively quickly\. Completions are far faster relatively, as the model is required to only generate a handful of tokens\.

## Appendix GReporting

For AIME and Cookingworld baseline experiments, baselines results are reported as average over 3 seeds for both base model and for the Behavior Cue Reasoning model\. For Hazardworld, only one seed is used as the seeds are meant to represent an explicit scenario set verified by the authors\.

For SFT and RL experiments for the monitor, only one seed is provided due to computational constraints in late stage testing\. The shaded region in the RL experiments represents the standard deviation from the smoothing performed for graph presentation\.

## Appendix HExtended Societal Impacts

This work is motivated by scalable oversight: making reasoning models easier to monitor and intervene on before final outputs are produced\. In beneficial deployments, Behavior Cues could help reduce wasted inference compute, expose decision points for external oversight, and prevent unsafe commitments in constrained environments\.

However, the same interface creates risks when reasoning is mediated through APIs or other systems that hide intermediate tokens from end users\. Because Behavior Cues can be parsed or enforced without being surfaced in the final response, a platform or intermediary could use them to silently steer model reasoning without the user’s awareness\. This may be beneficial when used for disclosed safety interventions, but it also raises transparency and consent concerns if cue\-based control is applied invisibly or for objectives misaligned with the user\.

Finally, Behavior Cues should not be interpreted as deterministic guarantees of model behavior\. LLM generation is inherently stochastic: our training procedure creates strong priors that anchor cue tokens to specific behaviors, but these associations are not guaranteed to hold on every sample or in every domain\. This is reflected in the varying degrees of cue compliance observed across our experiments\. We therefore view Behavior Cues as an interface that can improve monitorability and controllability, not as a replacement for careful validation, adversarial testing, and clear disclosure when reasoning\-control mechanisms are deployed\.

Similar Articles

Detecting misbehavior in frontier reasoning models

OpenAI Blog

OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.

Reasoning models struggle to control their chains of thought, and that’s good

OpenAI Blog

OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.

Evaluating chain-of-thought monitorability

OpenAI Blog

OpenAI researchers introduce a framework and suite of 13 evaluations to systematically measure chain-of-thought monitorability in large language models, finding that monitoring reasoning processes is substantially more effective than monitoring outputs alone, with important implications for AI safety and supervision at scale.