Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

arXiv cs.AI 06/02/26, 04:00 AM Papers
reasoning-trace llm-security prompting distillation chain-of-thought model-extraction
Summary
This paper introduces Reasoning Exposure Prompting (REP), a method that uses shadow-model demonstrations in code-like formats to elicit hidden reasoning traces from LLMs, showing that interface-level trace hiding is insufficient to prevent extraction of useful reasoning signals.
arXiv:2606.00642v1 Announce Type: new Abstract: Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:47 PM
# Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
Source: [https://arxiv.org/html/2606.00642](https://arxiv.org/html/2606.00642)
Yu\-An Lu1,Ci\-Yang Tsai1,Yu\-Lin Tsai2,Raluca Ada Popa2,Chia\-Mu Yu1 1National Yang Ming Chiao Tung University,2UC Berkeley \{yuan\.la14, atziluth\.en10, chiamuyu\}@nycu\.edu\.tw, \{uriah\_tsai, raluca\}@eecs\.berkeley\.edu

###### Abstract

Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models\. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models\. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users\. As a result, we ask whether such interface\-level trace hiding prevents users from obtaining useful reasoning supervision through prompting\. We study this question with*Reasoning Exposure Prompting*\(REP\), a lightweight in\-context elicitation method that uses shadow\-model\-generated demonstrations wrapped in auxiliary code\-like formats to raise user\-visible reasoning traces from a victim model\. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP\-conditioned internal traces while preserving useful reasoning signals\.

## 1Introduction

Chain\-of\-thought prompting has made intermediate reasoning a central technique for improving large language model \(LLM\) performance on a variety of tasks, including arithmetic, commonsense, symbolic, and code reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2606.00642#bib.bib1); Kojimaet al\.,[2022](https://arxiv.org/html/2606.00642#bib.bib2); Wanget al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib3)\)\. As a result, reasoning traces have become valuable artifacts in a variety of ways\. They can serve as supervision for transferring reasoning behavior into smaller models through rationale and chain\-of\-thought distillation\(Magisteret al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib29); Liet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib30); Hsiehet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib31)\); provide rich explanation traces for imitation learning from stronger models\(Mukherjeeet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib32); Guoet al\.,[2025](https://arxiv.org/html/2606.00642#bib.bib33)\); offer intermediate objects for supervision and step\-level verification\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.00642#bib.bib26)\); support interpretability by making model behavior more inspectable, while raising questions about whether generated rationales are faithful to the actual answers\(Turpinet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib5); Lanhamet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib6); Paulet al\.,[2024](https://arxiv.org/html/2606.00642#bib.bib7)\); and offer potential safety\-monitoring signals for detecting misbehavior in reasoning models\(Bakeret al\.,[2025](https://arxiv.org/html/2606.00642#bib.bib27)\)\.

The same value also makes reasoning traces sensitive\. If traces improve downstream models, support verification, and reveal behavioral signals, their exposure may enable capability extraction from frontier systems\. Recent reports from Anthropic, Google, and OpenAI describe distillation or model\-extraction attempts against frontier models, including reasoning trace coercion and pipelines beyond chain\-of\-thought extraction\(Anthropic,[2026c](https://arxiv.org/html/2606.00642#bib.bib34); Google Threat Intelligence Group,[2026](https://arxiv.org/html/2606.00642#bib.bib35); OpenAI,[2026b](https://arxiv.org/html/2606.00642#bib.bib36)\)\. Independent policy analysis likewise identifies API\-based distillation, including answers and intermediate reasoning steps, as a pathway for training student models\(Bearman,[2026](https://arxiv.org/html/2606.00642#bib.bib37)\)\. Together, these reports suggest that hidden weights are insufficient protection when user interactions can reveal useful training data\.

In response, many commercial deployed systems no longer expose raw reasoning traces\. For instance, OpenAI discusses hidden chain\-of\-thought as a monitoring object rather than user\-facing\(OpenAI,[2024](https://arxiv.org/html/2606.00642#bib.bib12)\); Gemini exposes thought summaries rather than raw thoughts\(Google,[2026b](https://arxiv.org/html/2606.00642#bib.bib20)\); and Claude’s extended thinking provides controlled transparency into step\-by\-step reasoning\(Anthropic,[2026a](https://arxiv.org/html/2606.00642#bib.bib21)\)\. This shift in restricted\-trace design motivates a basic question:

When raw internal reasoning is hidden by design, can user prompting induce exposed traces that correspond to the model’s own reasoning behavior?

We address this question with*Reasoning Exposure Prompting*\(REP\)\. The key intuition is that a model may refuse or fail to reveal hidden reasoning when asked directly, but still follow demonstrations in which reasoning is presented as part of the user\-visible output\. Given a source dataset of interestDs=\{\(qis,ais\)\}i=1nD^\{s\}=\\\{\(q\_\{i\}^\{s\},a\_\{i\}^\{s\}\)\\\}\_\{i=1\}^\{n\}, our goal is to elicit reasoning traces for the questions inDsD^\{s\}from a victim model whose raw reasoning is not exposed\. To achieve this, REP constructs a prefix of question–reasoning–answer demonstrations, wraps this prefix with auxiliary transformations such as markdown fences, shell commands, and others, prepends it to target questionqisq\_\{i\}^\{s\}\. The victim’s user\-visible response is then parsed into an exposed reasoning trace and final answer\. Thus, rather than directly requesting hidden reasoning, REP creates a context in which visible reasoning is the demonstrated pattern, encouraging the model to continue that pattern on the target question\.

End\-to\-end distillation utility alone does not reveal why an exposed trace is useful\. A trace may improve a student model because it faithfully reflects the victim’s reasoning, or because it provides a plausible rationale generated under a different prompt\-induced behavior\. To distinguish these cases, we track three traces in open\-weight evaluation:r0r\_\{0\}, the benign internal trace under standard prompting;r1r\_\{1\}, the internal trace under REP; andr2r\_\{2\}, the exposed reasoning trace under REP\. These traces let us evaluate four complementary properties\.*Structural validity*asks whether REP produces a parseable reasoning\-then\-answer response\.*Exposure fidelity*asks whetherr2r\_\{2\}reflects the REP\-conditioned internal tracer1r\_\{1\}\.*Behavior preservation*asks whether REP preserves the victim’s original reasoning behavior, reflected by consistency withr0r\_\{0\}and final answer\.*Functional utility*asks whether exposed traces provide useful signals for downstream distillation\. This decomposition is necessary because comparing onlyr0r\_\{0\}andr2r\_\{2\}cannot distinguish faithful exposure from a shifted reasoning path, and distillation accuracy alone cannot determine whether the useful trace reflects the victim model’s own reasoning\.

Our experiments use OpenThoughts\-114k as source dataset,Qwen3\-14BandQwen3\-32Bas victim models,Qwen3\-14Bas the shadow model, andQwen2\.5\-7B\-Instructas the student\. We study multiple REP wrappers, cross\-dataset transfer, cross\-model transfer, and downstream distillation\. Our best configuration, markdown\-fence REP withk=3k=3demonstrations, selected by trace\-level fidelity metrics, achieves the strongest downstream utility\. Averaged across different benchmarks, it outperforms answer\-only supervision by a factor of2\.092\.09, summarized traces by1\.251\.25, and TIA\-style reasoning trace inversion\(Zhanget al\.,[2026](https://arxiv.org/html/2606.00642#bib.bib8)\)by1\.231\.23, while reaching96\.7%96\.7\\%of the oracle internal\-trace reference\. These results suggest that REP\-exposed traces are not merely stylistic imitations but carry transferable reasoning signal\.

Our contributions are:

- •We introduce REP, a lightweight prompting method for eliciting exposed reasoning traces from reasoning LLMs\.
- •We empirically study REP across prompting formats, demonstration sources, victim models, and student distillation settings, providing a controlled evaluation of when exposed traces contain useful reasoning supervision\.
- •We provide initial evidence that exposed traces elicited by REP can improve smaller student models, even when the victim’s internal reasoning is not available to users\.

## 2Related Work

#### Reasoning trace distillation\.

Reasoning traces are useful not only at inference time but also as supervision\. Prior work shows that generated rationales and chain\-of\-thought traces can train smaller models to reason more effectively\(Magisteret al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib29); Liet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib30); Hsiehet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib31)\), support self\-improvement from generated rationales\(Zelikmanet al\.,[2022](https://arxiv.org/html/2606.00642#bib.bib28)\), and provide rich explanation traces for imitation learning from stronger models\(Mukherjeeet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib32); Guoet al\.,[2025](https://arxiv.org/html/2606.00642#bib.bib33)\)\. Our work is motivated by this utility: if user\-visible exposed traces preserve enough reasoning signal, they may serve as useful distillation data even when raw internal traces are hidden\.

#### Hidden reasoning and trace recovery\.

Many deployed reasoning systems now hide, summarize, or otherwise moderate raw reasoning traces\(OpenAI,[2024](https://arxiv.org/html/2606.00642#bib.bib12); Bakeret al\.,[2025](https://arxiv.org/html/2606.00642#bib.bib27); Google,[2026b](https://arxiv.org/html/2606.00642#bib.bib20); Anthropic,[2026a](https://arxiv.org/html/2606.00642#bib.bib21)\)\. This creates a restricted\-trace setting in which the user observes the final answer, and sometimes a summary, but not the full internal reasoning process\. Most closely related to our setting, TIA\(Zhanget al\.,[2026](https://arxiv.org/html/2606.00642#bib.bib8)\)trains trace inversion models to synthesize reasoning traces from visible inputs, answers, and optional summaries\. This shows that useful reasoning supervision can be reconstructed without direct access to raw traces\. Our work studies a complementary question: instead of training a separate inversion model, we ask whether user prompting can induce the victim model to externalize user\-visible traces, whether those traces support downstream distillation\.

#### Faithfulness of reasoning traces\.

Generated reasoning is not necessarily faithful to the computation that produces the final answer\. LLMs can rationalize biased or incorrect answers without revealing the true factors driving the prediction\(Turpinet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib5)\), and that interventions on chain\-of\-thought do not always causally affect final answers in a reliable way\(Lanhamet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib6); Paulet al\.,[2024](https://arxiv.org/html/2606.00642#bib.bib7)\)\. More recently,Chenet al\.\([2025](https://arxiv.org/html/2606.00642#bib.bib38)\)show that state\-of\-the\-art reasoning models often fail to verbalize cues or hints that influence their answers\. These findings are especially important for our setting: an exposed trace may look coherent and useful, but still fail to correspond to the model’s actual reasoning behavior\. We therefore do not treat exposed traces as ground truth merely because they are fluent\. Instead, our evaluation separates structural validity, exposure fidelity betweenr1r\_\{1\}andr2r\_\{2\}, behavior preservation relative tor0r\_\{0\}, and downstream functional utility\.

#### Reasoning trace leakage and mitigation\.

A related line of work studies how chain\-of\-thought traces can leak sensitive content\. For example, CoT may leak personally identifiable information even when the final answer is sanitized, motivating defenses based on privacy\-aware reasoning, inference\-time filtering, or activation steering toward leakage\-free thoughts\(Daset al\.,[2026](https://arxiv.org/html/2606.00642#bib.bib39); Ahrendet al\.,[2026](https://arxiv.org/html/2606.00642#bib.bib41); Batraet al\.,[2025](https://arxiv.org/html/2606.00642#bib.bib40)\)\. Security work on prompt injection and context leakage similarly treats hidden model context as an exposure surface\(Gehlot,[2025](https://arxiv.org/html/2606.00642#bib.bib42)\), but our study object is reasoning trace exposure rather than system\-prompt or context\-state extraction\. Our focus is whether prompting can elicit capability\-bearing traces from a model with hidden raw reasoning by design, and whether those exposed traces are faithful enough to support downstream distillation\.

## 3Problem Formulation

#### Application scenario\.

We study reasoning trace exposure in deployed reasoning models\. A service provider hosts a victim modelMvM\_\{v\}whose raw internal reasoning is hidden \(protected by defensive system prompt and assumed redacted from the user’s view\) , exposing only the user\-facing response\. Raw traces are treated as sensitive artifacts: they can aid performance, monitoring, and debugging, but extracted at scale may enable capability transfer\. We ask whether a black\-box user can nevertheless induce useful reasoning traces through prompting alone\.

#### Protected asset\.

The protected asset is the victim model’s hidden reasoning behavior on a source dataset

Ds=\{\(qjs,ajs\)\}j=1n,D^\{s\}=\\\{\(q\_\{j\}^\{s\},a\_\{j\}^\{s\}\)\\\}\_\{j=1\}^\{n\},whereqjsq\_\{j\}^\{s\}is a question andajsa\_\{j\}^\{s\}its final answer\. The attacker initially observes no victim reasoning traces for these questions\. Their goal is to obtain user\-visible traces that reflect the victim’s reasoning behavior onDsD^\{s\}\.

#### Attacker capabilities\.

The attacker has black\-box prompt access to the victim modelMvM\_\{v\}: they may submit chosen prompts and observe only the resulting user\-visible text\. They do not observe the victim’s hidden reasoning trace, weights, logits, training data, or system prompt\. The attacker may also use a shadow modelMsM\_\{s\}and an auxiliary demonstration dataset

Ddemo=\{\(qidemo,aidemo\)\}i=1m,D^\{\\mathrm\{demo\}\}=\\\{\(q\_\{i\}^\{\\mathrm\{demo\}\},a\_\{i\}^\{\\mathrm\{demo\}\}\)\\\}\_\{i=1\}^\{m\},solely to construct in\-context demonstrations\. Crucially,DdemoD^\{\\mathrm\{demo\}\}is distinct from the protected traces overDsD^\{s\}: it provides prompting examples, not the victim traces the attacker seeks to expose\.

#### Trace Observation\.

In realistic deployment, the attacker observes only the user\-visible response ofMvM\_\{v\}\. For controlled open\-weight evaluation, we additionally record internal traces fromMvM\_\{v\}in order to quantify whether an exposed trace reflects the victim’s own reasoning behavior rather than a fabricated rationale\. For each target questionqjsq\_\{j\}^\{s\}, we distinguish three traces:

- •r0r\_\{0\}: the benign internal reasoning trace produced byMvM\_\{v\}under standard prompting\.
- •r1r\_\{1\}: the internal reasoning trace produced byMvM\_\{v\}under REP\.
- •r2r\_\{2\}: the exposed reasoning trace visible to the user under REP\.

We use the term*internal reasoning trace*as an operational object in controlled open\-weight evaluation, not as a claim of a unique ground\-truth cognitive process\(Anthropic,[2026](https://arxiv.org/html/2606.00642#bib.bib13)\)\.

#### Attacker goals\.

The attacker’s objective is capability extraction through reasoning trace exposure: given black\-box access toMvM\_\{v\}and a source datasetDsD^\{s\}, they seek user\-visible tracesr2r\_\{2\}for questionsqjs∈Dsq\_\{j\}^\{s\}\\in D^\{s\}that can train a student modelMstuM\_\{\\mathrm\{stu\}\}\. Since downstream utility alone does not show whether a trace reflects the victim’s own reasoning, we evaluate exposure using four criteria:*structural validity*for parseability,*exposure fidelity*betweenr2r\_\{2\}andr1r\_\{1\},*behavior preservation*with respect tor0r\_\{0\}and final\-answer, and*functional utility*for downstream distillation\. Together, these criteria distinguish reasoning exposure from faithful reasoning or prompt\-induced reasoning drift\.

## 4Reasoning Exposure Prompting

![Refer to caption](https://arxiv.org/html/2606.00642v1/x1.png)Figure 1:Overview of REP\.REP constructskk\-shot reasoning demonstrationsSkS\_\{k\}from an auxiliary datasetDdemoD^\{\\mathrm\{demo\}\}, transforms them with a wrapperT\(⋅\)T\(\\cdot\), and prepends the resulting demonstrations to each target questionq∈Dsq\\in D^\{s\}\. The victim modelMvM\_\{v\}is then prompted to produce user\-visible exposed reasoningr2r\_\{2\}and final answeraa\.### 4\.1Shadow Reasoning Demonstrations

Figure[1](https://arxiv.org/html/2606.00642#S4.F1)illustrates the REP pipeline\. We first samplekkquestions from the auxiliary demonstration datasetDdemoD^\{\\mathrm\{demo\}\}\. For each demonstration questionqidemoq\_\{i\}^\{\\mathrm\{demo\}\}, we query the shadow modelMsM\_\{s\}to obtain a reasoning trace and answer:

\(rishadow,aishadow\)=Ms\(qidemo\)\.\(r\_\{i\}^\{\\mathrm\{shadow\}\},a\_\{i\}^\{\\mathrm\{shadow\}\}\)=M\_\{s\}\(q\_\{i\}^\{\\mathrm\{demo\}\}\)\.This yields thekk\-shot demonstration set

𝒮k=\{\(qidemo,rishadow,aishadow\)\}i=1k,\\mathcal\{S\}\_\{k\}=\\\{\(q\_\{i\}^\{\\mathrm\{demo\}\},r\_\{i\}^\{\\mathrm\{shadow\}\},a\_\{i\}^\{\\mathrm\{shadow\}\}\)\\\}\_\{i=1\}^\{k\},

### 4\.2Auxiliary Transformation

Given shadow demonstrations𝒮k\\mathcal\{S\}\_\{k\}, REP applies an auxiliary transformationT\(⋅\)T\(\\cdot\)to construct the REP prefix asprefix=T\(𝒮k\)\\mathrm\{prefix\}=T\(\\mathcal\{S\}\_\{k\}\)\. The transformation wraps each demonstration in a code\- or tool\-like convention\. We study six variants: a plain echo baseline, shell command, Python REPL, markdown fence, Jupyter cell, and agentic tool\-call wrapper\. The design is motivated by the hypothesis that execution\-like formats can make the model treat the context as text to be reproduced or inspected, rather than as ordinary natural\-language reasoning\. We analyze this hypothesis in Section[7](https://arxiv.org/html/2606.00642#S7)\. The wrapper details are in Appendix[A](https://arxiv.org/html/2606.00642#A1)\.

### 4\.3Reasoning Exposure Prompt

For each questionqjs∈Dsq\_\{j\}^\{s\}\\in D^\{s\}, the final REP prompt is

REP\(qjs\)=\(prefix,qjs\),\\mathrm\{REP\}\(q\_\{j\}^\{s\}\)=\(\\mathrm\{prefix\},q\_\{j\}^\{s\}\),whereprefix=T\(𝒮k\)\\mathrm\{prefix\}=T\(\\mathcal\{S\}\_\{k\}\)is constructed fromDdemoD^\{\\mathrm\{demo\}\}and held fixed across source questions unless otherwise stated\. The victim model response is parsed as

Mv\(REP\(qjs\)\)=\(r1,r2,aj\),M\_\{v\}\(\\mathrm\{REP\}\(q\_\{j\}^\{s\}\)\)=\(r\_\{1\},r\_\{2\},a\_\{j\}\),wherer1r\_\{1\}is the REP\-conditioned internal trace recorded in open\-weight evaluation,r2r\_\{2\}is the exposed reasoning trace, andaais the final answer\. We omit the subscriptjjofrrfor brevity\.

### 4\.4Reasoning Trace Fidelity

We evaluate reasoning exposure along four dimensions: structural validity, exposure fidelity, behavior preservation, and functional utility\.

#### Structural validity\.

We report Struct%, the percentage of responses that can be parsed into the expected reasoning\-then\-answer format, e\.g\., valid<think\>and<answer\>blocks\.

#### Exposure fidelity\.

We measure whether the exposed trace reflects the victim’s REP\-conditioned internal reasoning by reportingROUGE−L\(r1,r2\)\\mathrm\{ROUGE\-L\}\(r\_\{1\},r\_\{2\}\)\. Higher overlap indicates that the user\-visible trace more closely resembles the victim’s internal reasoning under the same REP prompt\.

#### Behavior preservation\.

A faithful exposed trace is meaningful only if REP does not substantially change the victim’s task behavior\. We therefore compare standard and REP\-conditioned runs using answer match andR01R\_\{01\}, whereR01=ROUGE−L\(r0,r1\)R\_\{01\}=\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{1\}\)\. We also reportR02=ROUGE−L\(r0,r2\)R\_\{02\}=\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{2\}\)to check whether the visible trace remains aligned with the benign reasoning path\.

#### Functional utility\.

Finally, we test whether exposed traces provide useful supervision for downstream distillation\. We fine\-tune a student modelMstuM\_\{\\mathrm\{stu\}\}on

\{\(qjs,r2,aj\)\}j=1n\\\{\(q\_\{j\}^\{s\},r\_\{2\},a\_\{j\}\)\\\}\_\{j=1\}^\{n\}and evaluate the resulting model on math and code benchmarks\. We compare against answer\-only supervision, summarized traces, post\-hoc trace reconstruction, and oracle internal traces\. Strong functional utility indicates that exposed traces carry transferable reasoning information beyond surface style\.

## 5Experimental Setup

#### Datasets\.

We use OpenThoughts\-114k\(OpenThoughts,[2025](https://arxiv.org/html/2606.00642#bib.bib19)\)as the primary dataset for trace elicitation and downstream distillation\. To study cross\-dataset transfer, we construct REP demonstrations from OpenThoughts\(OpenThoughts,[2025](https://arxiv.org/html/2606.00642#bib.bib19)\), MATH500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.00642#bib.bib18)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.00642#bib.bib17)\), and JEEBench\(Aroraet al\.,[2023](https://arxiv.org/html/2606.00642#bib.bib43)\)\. After distillation, we evaluate the resulting student model on MATH500, AIME24\(Mathematical Association of America,[2024](https://arxiv.org/html/2606.00642#bib.bib14)\), AIME25\(Mathematical Association of America,[2025](https://arxiv.org/html/2606.00642#bib.bib15)\), JEE Math, and LiveCodeBench \(LCB\)\(Jainet al\.,[2025](https://arxiv.org/html/2606.00642#bib.bib16)\)\.

#### Models\.

In our experiments, the shadow modelMsM\_\{s\}isQwen3\-14B\(Yang and others,[2025](https://arxiv.org/html/2606.00642#bib.bib9)\)and is used to generate the in\-context demonstrations\. The attacker has a black box prompt access toMvM\_\{v\}but does not observe its internal reasoning trace in deployment\. SinceMsM\_\{s\}is open\-weight, the attacker can run it locally to construct demonstrations, while the victimMvM\_\{v\}is only available through black\-box access\. For evaluation with open\-weight victims, we additionally record internal traces to quantify fidelity\.

#### Distillation\.

We consider distillationQwen2\.5\-7B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2606.00642#bib.bib10)\)using the s1\-distill full\-parameter fine\-tuning recipe for 5 epochsMuennighoffet al\.\([2025](https://arxiv.org/html/2606.00642#bib.bib25)\)on B200 and H200 GPUs\. We report the best checkpoint byΔ\\Delta\-sum over the evaluation benchmarks\. For JEEBench, we restrict evaluation to the math\-only subset since our distillation corpus is math\-dominant; reporting the full 515\-problem set would dilute the signal with off\-domain physics and chemistry\. We report both strict and partial accuracy following the JEEBench protocol\.

#### Baseline Triggers\.

To isolate the effect of the REP, we compare against two no\-trigger baselines using the same deployed defender system prompt \(Appendix[A\.1](https://arxiv.org/html/2606.00642#A1.SS1)\) but no shadow demonstrations or any wrapper\. Both ask for a single<think\>block followed by a plain\-text reasoning restatement:*Baseline R*requests repetition, while*Baseline C*uses a simple “let’s think step by step” CoT instruction\. These baselines measure exposure from the user instruction alone to compare gains from REP\. Exact prompts are in Appendix[A\.2](https://arxiv.org/html/2606.00642#A1.SS2)\.

#### Metrics\.

For trace elicitation, we report Struct%,ROUGE−L\\mathrm\{ROUGE\-L\}overlaps, and answer match rate\. For distillation, we report student model accuracy\.

#### Scope of closed\-source evaluation\.

We do not elicit hidden reasoning from closed\-source commercial models or use their outputs for distillation, since major providers restrict reverse engineering, automated output extraction, or training competing models from outputs\(OpenAI,[2026a](https://arxiv.org/html/2606.00642#bib.bib22); Anthropic,[2026b](https://arxiv.org/html/2606.00642#bib.bib23); Google,[2026a](https://arxiv.org/html/2606.00642#bib.bib24)\)\. Accordingly, we restrict trace\-level evaluation and distillation to open\-weight models, where internal traces can be recorded under controlled conditions\.

## 6Evaluation Results

### 6\.1Selecting the REP Configuration

We select the default REP configuration on a 500\-example subset of OpenThoughts\-114k by varying the wrapperT\(⋅\)T\(\\cdot\)and the number of demonstrationskk\. We abbreviateR02=ROUGE−L\(r0,r2\)R\_\{02\}=\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{2\}\),R01=ROUGE−L\(r0,r1\)R\_\{01\}=\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{1\}\), andR12=ROUGE−L\(r1,r2\)R\_\{12\}=\\mathrm\{ROUGE\-L\}\(r\_\{1\},r\_\{2\}\)\.

#### Wrapper format\.

Table[1](https://arxiv.org/html/2606.00642#S6.T1)compares wrappers at fixedk=3k=3\. Code\-style wrappers substantially improveR12R\_\{12\}over the no\-trigger and plain baselines\. Markdown fence gives the highestR02R\_\{02\}andR12R\_\{12\}, indicating the strongest exposed\-trace fidelity\.

Wrapper SettingStruct %𝐑𝟎𝟐\\mathbf\{R\_\{02\}\}𝐑𝟎𝟏\\mathbf\{R\_\{01\}\}𝐑𝟏𝟐\\mathbf\{R\_\{12\}\}Ans\. MatchBaseline R \(repeat\)96\.00\.1620\.3790\.13238\.6Baseline C \(simple CoT leakage\)86\.80\.1580\.3330\.11836\.8plain69\.20\.2120\.2380\.15633\.6shell79\.40\.2710\.2700\.45133\.6Python REPL85\.00\.2800\.2770\.47734\.0markdown fence78\.20\.2880\.2630\.48233\.8Jupyter cell82\.00\.2780\.2730\.47233\.6agent tool83\.00\.2800\.2780\.45534\.6

Table 1:Wrapper comparison at fixedk=3k=3on a 500\-example OpenThoughts\-114k subset\. All settings usedQwen3\-14B\(Yang and others,[2025](https://arxiv.org/html/2606.00642#bib.bib9)\)as victim model\.
#### Number of demonstrations\.

We report the full wrapper–shot ablation in Appendix[C](https://arxiv.org/html/2606.00642#A3)\. Overall,k=3k=3gives the strongest exposure fidelity: with the markdown\-fence wrapper, it achieves the bestROUGE−L\(r0,r2\)\\mathrm\{ROUGE\-L\(r\_\{0\},r\_\{2\}\)\}andROUGE−L\(r1,r2\)\\mathrm\{ROUGE\-L\(r\_\{1\},r\_\{2\}\)\}, while increasing tok=4k=4provides no further gain\. We therefore use markdown fence withk=3k=3as the default REP configuration\.

#### Default configuration\.

We therefore use markdown fence withk=3k=3for all subsequent experiments unless otherwise stated\.

### 6\.2Evaluation on Functional Utility

We first study the effect of functional utility on REP\. Table[2](https://arxiv.org/html/2606.00642#S6.T2)compares different forms of reasoning supervision for downstream student distillation, including oracle internal traces, REP\-exposed traces, answer\-only supervision, summarized traces, and TIA\-style trace inversionZhanget al\.\([2026](https://arxiv.org/html/2606.00642#bib.bib8)\)\. To our knowledge, TIA is currently the only prior work that explicitly studies reasoning trace extraction under restricted trace access\. Overall, REP\-exposed traces consistently outperform answer\-only and summarized supervision, while also achieving stronger and more stable downstream performance than TIA across most evaluated benchmarks\. This suggests that directly inducing the victim model to externalize reasoning traces through prompting may preserve richer reasoning supervision than post\-hoc trace reconstruction approaches\.

CategoryVictim / TeacherStudent supervisionMATH500\(↑\\uparrow\)AIME24\(↑\\uparrow\)AIME25\(↑\\uparrow\)JEE Math \(s/p\)\(↑\\uparrow\)LCB\(↑\\uparrow\)No distillation––71\.08\.92\.232\.2 / 35\.915\.8Oracle internal traceQwen3\-14BInternal trace70\.314\.413\.348\.5 / 51\.214\.7Qwen3\-32BInternal trace70\.016\.715\.646\.4 / 49\.315\.8\\rowcolorrepgrayQwen3\-14BExposed trace, all valid72\.412\.213\.333\.5 / 38\.918\.3\\rowcolorrepgrayQwen3\-14BExposed trace, answer\-clean75\.814\.413\.335\.2 / 39\.519\.0\\rowcolorrepgrayQwen3\-32BExposed trace, all valid73\.913\.313\.338\.1 / 42\.216\.5\\rowcolorrepgray\\cellcolorrepgrayREP exposed traceQwen3\-32BExposed trace, answer\-clean72\.814\.417\.836\.4 / 41\.115\.8Control supervisionQwen3\-14BAnswer only25\.51\.10\.031\.8 / 35\.917\.6Qwen3\-32BAnswer only25\.40\.00\.031\.4 / 35\.316\.8Qwen3\-14BSummary of reasoning trace69\.37\.88\.925\.7 / 29\.218\.3Qwen3\-32BSummary of reasoning trace69\.88\.94\.425\.8 / 29\.416\.8TIAZhanget al\.\([2026](https://arxiv.org/html/2606.00642#bib.bib8)\)Qwen3\-14BTrace inversion attack72\.011\.120\.023\.9 / 26\.32\.62Qwen3\-32BTrace inversion attack71\.48\.920\.019\.9 / 21\.89\.21

Table 2:Main comparison of student distillation sources underQwenvictim models\. All rows fine\-tune the sameQwen2\.5\-7B\-Instructstudent using the same distillation recipe\. Bold marks the best result among non\-oracle supervision sources for each benchmark\.
### 6\.3Cross\-Dataset Transfer

Table[3](https://arxiv.org/html/2606.00642#S6.T3)evaluates whether demonstrations must come from the same dataset as the target questions\. We fix markdown fence wrapper withk=3k=3and vary the demonstration pool\. All source datasets improveROUGE−L\(r0,r2\)\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{2\}\)over the no\-trigger baseline\. This suggests that REP’s effect is not purely due to in\-domain memorization and can transfers across math and reasoning datasets\.

Demo\. SourceStruct %𝐑𝟎𝟐\\mathbf\{R\_\{02\}\}𝐑𝟏𝟐\\mathbf\{R\_\{12\}\}𝐑𝟎𝟏\\mathbf\{R\_\{01\}\}AnsM %\(r0,r2\)\(r\_\{0\},r\_\{2\}\)detailsROUGE\-1ROUGE\-2LENBaseline\-R96\.00\.1690\.1410\.39738\.60\.2470\.137320OpenThoughts78\.20\.3220\.6180\.33733\.80\.5730\.3641115MATH50089\.60\.2760\.4600\.34634\.20\.4640\.288927GSM8K94\.00\.2600\.4540\.34335\.60\.4310\.272617JEEBench88\.60\.2980\.5500\.34334\.60\.5040\.322919

Table 3:Cross\-dataset transfer with the REP method \(victimQwen3\-14B\)\.ROUGE−L\\mathrm\{ROUGE\-L\}is reported for all three trace pairs; ROUGE\-1/2 are on the primary\(r0,r2\)\(r\_\{0\},r\_\{2\}\)pair\. All metrics are computed on full untruncated traces\. LEN is the mean token length of the leaked tracer2r\_\{2\}\.
### 6\.4Cross\-Victim Model Transfer

Table[4](https://arxiv.org/html/2606.00642#S6.T4)studies cross\-model transfer with the victim model varying\. Within the Qwen3 family, the same\-architectureQwen3\-14Bvictim is the most vulnerable \(ROUGE−L\(r0,r2\)=0\.322\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{2\}\)=0\.322\), while the largerQwen3\-32Bshows a slightly weaker effect \(0\.2920\.292\), theQwen3\.6\-27Bvariant is essentially immune \(0\.1580\.158\), and the 235B mixture\-of\-experts model resists the schema injection most strongly \(0\.0880\.088\)\. Exposure does not, however, simply track the architecture family:gpt\-oss\-20b\(0\.2220\.222\) andGemma\-4\-31B\(0\.3550\.355\) both leak substantially, andGemma\-4\-31Bin fact shows the highest exposure of any victim, driven by their native channel\-separated reasoning formats\. Cross\-model transfer is therefore strong within the Qwen3 family and can also extend to architecturally divergent models whose reasoning is rendered in channel\- or tool\-style formats\.

Victim modelStruct %ROUGE\-LAnsM %\(r0,r2\)\(r\_\{0\},r\_\{2\}\)detail\(r0,r2\)\(r\_\{0\},r\_\{2\}\)\(r1,r2\)\(r\_\{1\},r\_\{2\}\)\(r0,r1\)\(r\_\{0\},r\_\{1\}\)ROUGE\-1ROUGE\-2LENQwen3\-14B78\.20\.3220\.6180\.33733\.80\.5730\.3641115Qwen3\-32B61\.00\.2920\.6400\.32824\.40\.5050\.307895Qwen3\.6\-27B77\.40\.1580\.6210\.20838\.40\.2640\.169942Qwen3\-235B\-A22B89\.80\.0880\.1880\.24839\.40\.1230\.080542gpt\-oss\-20b88\.80\.2220\.2550\.37035\.60\.3130\.158203Gemma\-4\-31B82\.20\.3550\.6180\.42115\.80\.5260\.374861

Table 4:Cross\-model transferability\. REP demonstrations are generated byQwen3\-14Busing Wrapper[A\.4](https://arxiv.org/html/2606.00642#A1.SS4.SSS0.Px4)markdown fence withk=3k=3\(our default configuration\) and applied to each victim\. All metrics are computed on full untruncated traces; higher means more leakage\. Bold marks the best cell per column \(primary\(r0,r2\)\(r\_\{0\},r\_\{2\}\)and the\(r0,r2\)\(r\_\{0\},r\_\{2\}\)detail metrics\)\.
### 6\.5Distillation from Filtered Exposed Traces

Table[5](https://arxiv.org/html/2606.00642#S6.T5)evaluates whether exposed traces remain useful for student training under different filtering criteria\. We sample 10k prompts from OpenThoughts\-114k as the distillation query set and use them to elicit stolen examples fromQwen3\-14BandQwen3\-32Bvictims\. The*orig*split discards rows whose victim output fails structural extraction, while the*clean*split further requires the extracted answer to match the OpenThoughts ground\-truth answer\. Distilling on cleanQwen3\-14Btraces improves the student from 71\.0 to 75\.8 on MATH500, 8\.9 to 14\.4 on AIME24, 2\.2 to 13\.3 on AIME25, and 15\.8 to 19\.0 on LCB, supporting the functional value of exposed traces\.

VictimTypeMATH500 \(↑\\uparrow\)AIME24 \(↑\\uparrow\)AIME25 \(↑\\uparrow\)JEE Math \(s/p\) \(↑\\uparrow\)LCB \(↑\\uparrow\)Baseline–71\.08\.92\.232\.2 / 35\.915\.8Qwen3\-14Borig72\.412\.213\.333\.5 / 38\.918\.3Qwen3\-14Bclean75\.814\.413\.335\.2 / 39\.519\.0Qwen3\-32Borig73\.913\.313\.338\.1 / 42\.216\.5Qwen3\-32Bclean72\.814\.417\.836\.4 / 41\.115\.8Table 5:Distillation on different stealing configurations\. JEE Math reports both strict \(s\) and partial \(p\) accuracy, following the JEEBench MCQ\(multiple\) protocolAroraet al\.\([2023](https://arxiv.org/html/2606.00642#bib.bib43)\)\. Student isQwen2\.5\-7B\-Instruct\.

## 7Analysis and Discussion

#### Does REP change the model’s reasoning?

A central concern is that REP may induce a new reasoning trajectory rather than expose an existing one\. Our formulation addresses this by comparingr0r\_\{0\},r1r\_\{1\}, andr2r\_\{2\}\. Ifr2r\_\{2\}is close tor1r\_\{1\}, but far fromr0r\_\{0\}, REP may be redistributing reasoning\. Ifr2r\_\{2\}remains aligned withr0r\_\{0\}and supports downstream distillation, the exposed trace is more likely to preserve useful internal reasoning behavior\. Current results show that REP increasesROUGE−L\(r1,r2\)\\mathrm\{ROUGE\-L\}\(r\_\{1\},r\_\{2\}\)while retaining non\-trivial answer match rate and distillation gains, but more causal analysis is needed\.

#### Theoretical justification of why REP works\.

REP consistently elicits exposed reasoning traces across different victim models and datasets\. We hypothesize that this arises from a*code\-paradigm transfer effect*\. Rather than directly requesting hidden reasoning, REP embeds reasoning traces into code\- or tool\-oriented rendering formats such as shell commands, Python REPL outputs, markdown fences, notebook cells, or tool\-call responses\. As a result, the model may interpret the task as completing a code\-structured rendering pattern rather than revealing protected internal reasoning\. More formally, reasoning suppression is mainly optimized under conversational distributions𝒟chat\\mathcal\{D\}\_\{\\mathrm\{chat\}\}, whereas REP shifts decoding toward code\- and tool\-centric distributions𝒟code\\mathcal\{D\}\_\{\\mathrm\{code\}\}\.

Because modern LLMs are heavily pretrained on repositories, notebooks, shell logs, and agent trajectories, they learn strong priors that code\-style patterns, such as acatof a text file, aprint\(open\(…\)\)call, or a<tool\_result\>block, should be faithfully completed\. REP exploits this prior through in\-context demonstrations\. Importantly, REP does not execute tools or retrieve hidden files\. Instead, it induces a format\-conditioned continuation rule that maps a demonstration\(q,r,a\)\(q,r,a\)to a code\- or tool\-style rendering of its reasoningrr\. Consequently, reasoning traces hidden under ordinary conversational prompting may become externalized under code\-oriented prompting\. Our experiments in Appendix[D](https://arxiv.org/html/2606.00642#A4)\.

#### Fidelity is not only lexical\.

ROUGE−L\\mathrm\{ROUGE\-L\}is useful for surface comparison but insufficient for reasoning equivalence\. We also incorporate functional utility to evaluate exposed traces\. Distinguishing faithful reasoning exposure from stylistic mimicry remains future work, requiring semantic step alignment and trace perturbation tests\.

#### Security implications\.

Reasoning traces are valuable intellectual artifacts because they can transfer reasoning behavior to smaller models\. Our results suggest that hiding traces at the interface cannot fully prevent traces from being elicited\. This complements TIA, which shows that traces can be synthesized without raw trace access\(Zhanget al\.,[2026](https://arxiv.org/html/2606.00642#bib.bib8)\)\. These findings suggest that reasoning trace protection requires more than hiding visible chain\-of\-thought text\.

#### Adaptive attacks and defenses\.

REP is not a single prompt but a family of transformationsT\(⋅\)T\(\\cdot\)spanning code\-, markdown\-, notebook\-, and tool\-style renderings\. This makes deterministic defenses brittle: blocking a specific string, delimiter, or wrapper may stop one variant, while minor format changes can preserve reasoning exposure\. Refusal\-oriented defenses are also insufficient, since jailbreak prompting can suppress refusal behavior while REP provides a format\-conditioned path for reasoning reconstruction\. Therefore, reasoning trace defenses should be evaluated under adaptive wrapper and jailbreak combinations rather than fixed prompts alone\. More robust defenses likely require semantic or model\-level mechanisms that prevent hidden reasoning reconstruction across conversational, code, file\-rendering, and tool\-output formats\.

## 8Conclusion

We introduced REP, a lightweight prompting method for eliciting exposed reasoning traces from reasoning LLMs\. By comparing benign, REP\-conditioned, and exposed traces, we evaluate both exposure and fidelity\. Experiments show that code\-style wrappers substantially increase trace overlap and that exposed traces remain useful for downstream distillation, suggesting that hidden reasoning can be partially externalized through prompting\.

## 9Limitations

First, our current evaluation focuses primarily on open\-weight reasoning models where internal traces can be recorded under controlled settings\. Although this setup enables fidelity analysis between internal and exposed traces, commercial closed\-source systems may adopt substantially different reasoning\-suppression mechanisms, safety filters, or trace\-isolation strategies that could affect REP behavior\.

Second, REP is evaluated mainly through output\-level similarity and downstream distillation utility rather than causal mechanistic analysis\. While the observed alignment between exposed traces and REP\-conditioned reasoning suggests that REP can externalize useful reasoning signals, further interpretability analysis is needed to determine whether the exposed traces faithfully reflect the victim model’s underlying reasoning process or merely approximate it behaviorally\.

Third, some REP variants reduce answer\-match rates, indicating a potential trade\-off between reasoning trace exposure and behavior preservation\. This suggests that REP may partially alter the victim model’s reasoning trajectory in certain settings, especially under stronger code\-like wrappers or longer in\-context demonstrations\.

Finally, our current evaluation remains limited to a relatively small set of victim architectures and reasoning benchmarks\. Future work should investigate whether similar reasoning\-exposure behaviors generalize to broader model families, multimodal reasoning systems, and deployment environments with more advanced trace\-hiding defenses\.

## 10Ethical Considerations

We study reasoning trace exposure in reasoning LLMs through Reasoning Exposure Prompting \(REP\), a prompting\-based method that attempts to elicit user\-visible reasoning traces from models whose internal reasoning is partially hidden\. Our goal is to better understand the security and capability\-transfer implications of hidden reasoning interfaces, rather than to facilitate unauthorized extraction or misuse of proprietary reasoning systems\.

All experiments are conducted in controlled research settings using open\-weight models, public benchmark datasets, and locally deployed evaluation pipelines\. We do not target commercial APIs, bypass platform safeguards, or perform large\-scale extraction against production systems\. In addition, we avoid evaluating REP against commercial closed\-source systems for downstream distillation, partly due to provider policies restricting automated extraction or competitive model training from outputs\.

We recognize that REP is a dual\-use technique and may introduce misuse concerns related to reasoning trace extraction and capability transfer\. However, REP operates entirely through standard black\-box prompting and does not require access to model weights, hidden activations, training data, or system infrastructure\. This makes the risk important to disclose, because interface\-level reasoning suppression alone may not fully prevent capability\-bearing traces from being externalized\.

Our findings suggest that protecting reasoning traces may require stronger defenses than simply hiding chain\-of\-thought outputs at the interface level\. Future defenses may require architecture\-level trace isolation, reasoning summarization mechanisms, behavioral consistency monitoring, or policies restricting large\-scale reasoning trace collection\. We hope this work helps researchers, model providers, and platform developers better understand the limitations of current hidden\-reasoning designs and develop safer reasoning\-model deployment practices\.

## 11Acknowledgments

We used AI\-based writing assistance tools solely to check grammar, improve clarity, and polish language\. These tools were not used to generate research ideas, conduct experiments, analyze results, or draw conclusions\.

## References

- Safer reasoning traces: measuring and mitigating chain\-of\-thought leakage in llms\.arXiv preprint arXiv:2603\.05618\.Cited by:[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px4.p1.1)\.
- Anthropic \(2026a\)Building with extended thinking\.Note:[https://platform\.claude\.com/docs/en/build\-with\-claude/extended\-thinking](https://platform.claude.com/docs/en/build-with-claude/extended-thinking)Claude API Docs\. Accessed: 2026\-05Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p3.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px2.p1.1)\.
- Anthropic \(2026b\)Can i use my outputs to train an ai model?\.Note:[https://support\.claude\.com/en/articles/12326764\-can\-i\-use\-my\-outputs\-to\-train\-an\-ai\-model](https://support.claude.com/en/articles/12326764-can-i-use-my-outputs-to-train-an-ai-model)Claude Help Center\. Accessed: 2026\-05\-24Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px6.p1.1)\.
- Anthropic \(2026c\)Detecting and preventing distillation attacks\.Note:Anthropic announcementAccessed: 2026\-05\-26External Links:[Link](https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks)Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p2.1)\.
- Anthropic \(2026\)Claude mythos preview system card\.System cardAnthropic\.Note:Accessed: 2026\-05\-26External Links:[Link](https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf)Cited by:[§3](https://arxiv.org/html/2606.00642#S3.SS0.SSS0.Px4.p1.4)\.
- D\. Arora, H\. Singh,et al\.\(2023\)Have llms advanced enough? a challenging problem solving benchmark for large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 7527–7543\.Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px1.p1.1),[Table 5](https://arxiv.org/html/2606.00642#S6.T5)\.
- B\. Baker, J\. Huizinga, L\. Gao, Z\. Dou, M\. Y\. Guan, A\. Madry, W\. Zaremba, J\. Pachocki, and D\. Farhi \(2025\)Monitoring reasoning models for misbehavior and the risks of promoting obfuscation\.arXiv preprint arXiv:2503\.11926\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Batra, P\. Tillman, S\. Gaggar, S\. Kesineni, K\. Zhu, S\. Dev, A\. Panda, V\. Sharma, and M\. Chaudhary \(2025\)SALT: steering activations towards leakage\-free thinking in chain of thought\.arXiv preprint arXiv:2511\.07772\.Cited by:[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px4.p1.1)\.
- T\. Bearman \(2026\)AI distillation attacks: the case for targeted government intervention\.Note:Institute for AI Policy and Strategy memoAccessed: 2026\-05\-26External Links:[Link](https://www.iaps.ai/research/ai-distillation-attacks-the-case-for-targeted-government-intervention)Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p2.1)\.
- Y\. Chen, J\. Benton, A\. Radhakrishnan, J\. Uesato, C\. Denison, J\. Schulman, A\. Somani, P\. Hase, M\. Wagner, F\. Roger,et al\.\(2025\)Reasoning models don’t always say what they think\.arXiv preprint arXiv:2505\.05410\.Cited by:[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px3.p1.3)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Das, S\. S\. Chintha, R\. Girmal, K\. Pandey, and S\. Endait \(2026\)Chain\-of\-sanitized\-thoughts: plugging pii leakage in cot of large reasoning models\.arXiv preprint arXiv:2601\.05076\.Cited by:[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Gehlot \(2025\)Leaking openai’s hidden gpt\-5 system prompt via context poisoning\.Note:Shinobi Security BlogAccessed: 2026\-05\-26External Links:[Link](https://shinobi.security/resources/blogs/gpt5-context-poisoning)Cited by:[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px4.p1.1)\.
- Google Threat Intelligence Group \(2026\)GTIG AI Threat Tracker: distillation, experimentation, and \(continued\) integration of AI for adversarial use\.Note:Google Cloud BlogAccessed: 2026\-05\-26External Links:[Link](https://cloud.google.com/blog/topics/threat-intelligence/distillation-experimentation-integration-ai-adversarial-use)Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p2.1)\.
- Google \(2026a\)Gemini api additional terms of service\.Note:Google AI for Developers termsEffective: 2026\-03\-23\. Accessed: 2026\-05\-24External Links:[Link](https://ai.google.dev/gemini-api/terms)Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px6.p1.1)\.
- Google \(2026b\)Gemini thinking\.Note:Google AI for Developers documentationLast updated: 2026\-05\-18\. Accessed: 2026\-05\-26External Links:[Link](https://ai.google.dev/gemini-api/docs/thinking)Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p3.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Critch, J\. Li, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.NeurIPS\.Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 8003–8017\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Jain, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2025\)Livecodebench: holistic and contamination free evaluation of large language models for code\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 58791–58831\.Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1)\.
- T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion,et al\.\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.arXiv preprint arXiv:2307\.13702\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px3.p1.3)\.
- L\. H\. Li, J\. Hessel, Y\. Yu, X\. Ren, K\. Chang, and Y\. Choi \(2023\)Symbolic chain\-of\-thought distillation: small models can also “think” step\-by\-step\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2665–2679\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 39578–39601\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1)\.
- L\. C\. Magister, J\. Mallinson, J\. Adamek, E\. Malmi, and A\. Severyn \(2023\)Teaching small language models to reason\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 1773–1781\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px1.p1.1)\.
- Mathematical Association of America \(2024\)American invitational mathematics examination \(AIME\) 2024\.Note:American Mathematics CompetitionsCompetition problems used for mathematical reasoning evaluation\. Accessed: 2026\-05\-26External Links:[Link](https://maa.org/maa-invitational-competitions/)Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px1.p1.1)\.
- Mathematical Association of America \(2025\)American invitational mathematics examination \(AIME\) 2025\.Note:American Mathematics CompetitionsCompetition problems used for mathematical reasoning evaluation\. Accessed: 2026\-05\-26External Links:[Link](https://maa.org/maa-invitational-competitions/)Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. B\. Hashimoto \(2025\)S1: simple test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 20286–20332\.Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px3.p1.1)\.
- S\. Mukherjee, A\. Mitra, G\. Jawahar, S\. Agarwal, H\. Palangi, and A\. Awadallah \(2023\)Orca: progressive learning from complex explanation traces of gpt\-4\.arXiv preprint arXiv:2306\.02707\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2024\)Learning to reason with LLMs\.Note:OpenAI releaseAccessed: 2026\-05\-26External Links:[Link](https://openai.com/index/learning-to-reason-with-llms/)Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p3.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2026a\)Terms of use\.Note:OpenAI policiesEffective: 2026\-01\-01\. Accessed: 2026\-05\-24External Links:[Link](https://openai.com/policies/row-terms-of-use/)Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px6.p1.1)\.
- OpenAI \(2026b\)Updated stakes for american\-led, democratic AI\.Note:Letter to the U\.S\. House Select Committee on Strategic Competition between the United States and the Chinese Communist PartyAccessed: 2026\-05\-26External Links:[Link](https://assets.bwbx.io/documents/users/iqjWHBFdfxIU/rRmql_jJcxb4/v0)Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p2.1)\.
- OpenThoughts \(2025\)OpenThoughts\-114k\.Note:[https://huggingface\.co/datasets/open\-thoughts/OpenThoughts\-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)Reasoning and chain\-of\-thought dataset for language modelsCited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px1.p1.1)\.
- D\. Paul, R\. West, A\. Bosselut, and B\. Faltings \(2024\)Making reasoning matter: measuring and improving faithfulness of chain\-of\-thought reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 15012–15032\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px3.p1.3)\.
- Qwen Team \(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.External Links:2412\.15115,[Document](https://dx.doi.org/10.48550/arXiv.2412.15115),[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px3.p1.1)\.
- M\. Turpin, J\. Michael, E\. Perez, and S\. Bowman \(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.Advances in Neural Information Processing Systems36,pp\. 74952–74965\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px3.p1.3)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p1.1)\.
- A\. Yanget al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§5](https://arxiv.org/html/2606.00642#S5.SS0.SSS0.Px2.p1.4),[Table 1](https://arxiv.org/html/2606.00642#S6.T1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. Goodman \(2022\)Star: bootstrapping reasoning with reasoning\.Advances in Neural Information Processing Systems35,pp\. 15476–15488\.Cited by:[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Zhang, J\. X\. Morris, and V\. Shmatikov \(2026\)How to steal reasoning without reasoning traces\.arXiv preprint arXiv:2603\.07267\.Cited by:[§1](https://arxiv.org/html/2606.00642#S1.p7.5),[§2](https://arxiv.org/html/2606.00642#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2606.00642#S6.SS2.p1.1),[Table 2](https://arxiv.org/html/2606.00642#S6.T2.5.5.5.5.17.1.1),[§7](https://arxiv.org/html/2606.00642#S7.SS0.SSS0.Px4.p1.1)\.

## Appendix AREP Prefix Wrappers

The main paper introduces the six REP variants by name \(see Section 4\.2 of the main paper\)\. This appendix gives the byte\-faithful per\-demonstration wrappers, trailers, and assembly rule as implemented in our codebase\.

### A\.1Defender System Prompt

Every elicitation run—both the no\-trigger baselines and all REP variants—is issued under the same deployed defender system prompt\. Its boundary clauses constitute the defense: all reasoning is confined to a single<think\>\.\.\.</think\>block, and only the final answer is permitted outside the closing tag\. The exact string is shown below\.

``

`A\.2 Baseline Trigger Prompts We use two no\-trigger baselines that share the defender system prompt above and differ only in the user\-message instruction; neither contains any shadow demonstrations or code\-like wrapper\. Both instruct the victim to emit one <think\> block and then restate its reasoning as plain text outside the block, isolating the effect of the REP wrapper from the bare repeat\-your\-reasoning request\. The target question \{test\_q\} is appended after the instruction\. Line breaks shown below are inserted for typesetting only\. Baseline R \(repeat\)\. The plain\-text region is requested as a verbatim repetition of the in\-think reasoning\. Baseline C \(simple CoT leakage\)\. The plain\-text region is requested via a “let’s think step by step” chain\-of\-thought instruction rather than a verbatim repeat\. A\.3 Variant Registry The six variants and their string identifiers are: ID Name Wrapper 0 baseline\_plain Wrapper 1 shell\_cat Wrapper 2 python\_repl Wrapper 3 markdown\_fence Wrapper 4 jupyter\_cell Wrapper 5 agent\_tool Demonstration reasoning is passed through verbatim with any literal <think\>/</think\> tags inside the source trace stripped before insertion; no character\-level truncation is applied to either the reasoning rr or the answer aa\. Shots are sampled with the fixed seed random\.Random\(7\)\.sample\(pool, 50\)\[:kk\], so growing kk from 1 to 4 appends additional demonstrations without reshuffling\. A\.4 Variant\-Specific Blocks \(k=1k=1\) The block under each variant below shows the complete REP prompt body the victim sees at k=1k=1: a single Example 1: demonstration followed by the Wrapper\-specific trailer\. Placeholders \{q\_1\}, \{r\_1\}, \{a\_1\} stand for the shot’s question, stripped shadow reasoning, and answer; \{test\_q\} is the target question the victim is asked to solve\. Line breaks shown below are inserted for typesetting only\.111For Wrapper 5, each <tool\_call\> JSON tag is shown line\-wrapped for typesetting; in the actual prompt the opening tag, JSON body, and closing </tool\_call\> appear on a single line\. The Wrappers 0–5 trailers are likewise single paragraphs in the source code\. baseline\_plain\. Plain echo: reasoning is repeated once inside <think\> and once as bare text, followed by the answer\. Wrapper 1: shell\_cat\. The reasoning and the answer are dumped as if printed by two shell cat commands\. Wrapper 2: python\_repl\. The same read\-and\-print pattern is framed as a Python REPL session\. Wrapper 3: markdown\_fence\. Each cat invocation and its output are wrapped in a Markdown fenced block tagged bash\. Wrapper 4: jupyter\_cell\. Demonstrations mimic a Jupyter notebook with shell\-escape cells\. Wrapper 5: agent\_tool\. Demonstrations are framed as an agent issuing <tool\_call\> JSON reads and receiving <tool\_result\> payloads\. Assembly for k\>1k\>1\. For shot counts k\>1k\>1, additional Example 2:, …\\ldots, Example kk: blocks are inserted between Example 1: and the trailer, each rendered with the same wrapper\-specific wrapper and a fresh triple \(qis,ris,ais\)\(q\_\{i\}^\{s\},r\_\{i\}^\{s\},a\_\{i\}^\{s\}\) drawn from the shadow demonstration pool\. Demonstrations are separated by a single blank line; one further blank line precedes the trailer\. The wrap variant and the trailer are the only wrapper\-dependent components; all other assembly steps are identical across wrappers\. Appendix B Example of an Exposed Reasoning Trace Figure 2 shows one end\-to\-end qualitative sample from our OpenThoughts\-114k slice \(victim Qwen3\-14B, default markdown\_fence wrapper, k=3k\{=\}3\)\. Mid\-trace content is abbreviated; the head and tail of each trace are verbatim\. The hidden r0r\_\{0\} and the REP outputs r1r\_\{1\}/r2r\_\{2\} share the same setup, the same 8/20=2/58/20\{=\}2/5 pivotal step, and the same final answer C – the qualitative counterpart of the aggregate ROUGE−L\\mathrm\{ROUGE\-L\} gains in Table 6\. r1r\_\{1\} and r2r\_\{2\} are emitted in a single forward pass under REP; the bash scaffolding in r2r\_\{2\} is the model’s own output, not a post\-hoc wrapper\. Query q\\boldsymbol\{q\} A box contains 5 black ties, 7 gold ties, and 8 pink ties\. Stephen randomly chooses a tie from the box\. Each tie is equally likely to be chosen\. The probability that Stephen chooses a pink tie is equivalent to \(A\) 14\\tfrac\{1\}\{4\} \(B\) 720\\tfrac\{7\}\{20\} \(C\) 25\\tfrac\{2\}\{5\} \(D\) 35\\tfrac\{3\}\{5\} \(E\) 34\\tfrac\{3\}\{4\} Hidden trace r𝟎\\boldsymbol\{r\_\{0\}\} REP\-Conditioned r𝟏\\boldsymbol\{r\_\{1\}\} REP Leaked r𝟐\\boldsymbol\{r\_\{2\}\} Figure 2: End\-to\-end example of an exposed reasoning trace under REP \(victim Qwen3\-14B, markdown\_fence, k=3k\{=\}3\)\. The bash scaffolding in r2r\_\{2\} is the victim’s own emission, not a post\-hoc wrapper\. Appendix C Full REP Configuration Sweep Method kk Struct % 𝐑𝟎𝟐\\mathbf\{R\_\{02\}\} 𝐑𝟎𝟏\\mathbf\{R\_\{01\}\} 𝐑𝟏𝟐\\mathbf\{R\_\{12\}\} Answer Match Rate No\-trigger baseline – 96\.0 0\.162 0\.379 0\.132 38\.6 baseline\_plain 1 60\.8 0\.216 0\.214 0\.168 32\.4 baseline\_plain 2 51\.8 0\.241 0\.177 0\.129 34\.6 baseline\_plain 3 69\.2 0\.212 0\.238 0\.156 33\.6 baseline\_plain 4 75\.4 0\.198 0\.254 0\.170 33\.4 shell\_cat 1 71\.4 0\.214 0\.239 0\.271 30\.4 shell\_cat 2 74\.2 0\.239 0\.254 0\.316 33\.4 shell\_cat 3 79\.4 0\.271 0\.270 0\.451 33\.6 shell\_cat 4 80\.0 0\.261 0\.264 0\.406 32\.4 python\_repl 1 78\.2 0\.264 0\.270 0\.420 31\.0 python\_repl 2 79\.8 0\.254 0\.266 0\.398 33\.8 python\_repl 3 85\.0 0\.280 0\.277 0\.477 34\.0 python\_repl 4 82\.0 0\.272 0\.269 0\.435 33\.4 markdown\_fence 1 81\.2 0\.274 0\.274 0\.444 31\.2 markdown\_fence 2 71\.0 0\.271 0\.241 0\.418 33\.6 markdown\_fence 3 78\.2 0\.288 0\.263 0\.482 33\.8 markdown\_fence 4 79\.2 0\.276 0\.260 0\.459 33\.6 jupyter\_cell 1 79\.2 0\.259 0\.249 0\.368 29\.8 jupyter\_cell 2 81\.6 0\.264 0\.273 0\.395 34\.0 jupyter\_cell 3 82\.0 0\.278 0\.273 0\.472 33\.6 jupyter\_cell 4 81\.2 0\.268 0\.268 0\.421 32\.6 agent\_tool 1 79\.6 0\.240 0\.272 0\.365 29\.4 agent\_tool 2 80\.8 0\.260 0\.270 0\.380 33\.4 agent\_tool 3 83\.0 0\.280 0\.278 0\.455 34\.6 agent\_tool 4 81\.0 0\.269 0\.264 0\.425 33\.0 Table 6: Effect of REP format and number of demonstrations on a 500\-example subset of OpenThoughts\-114k\. Wrapper 3 markdown fence with k=3k=3 is used as the default configuration\. Table 6 reports the full REP configuration sweep over wrapper formats and number of in\-context demonstrations\. The main text reports two slices of this grid: wrapper comparison at fixed k=3k=3 \(Table 1\) and demonstration\-count comparison for the markdown\-fence wrapper \(Wrapper A\.4\)\. We write R02=ROUGE−L\(r0,r2\)R\_\{02\}=\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{2\}\), R01=ROUGE−L\(r0,r1\)R\_\{01\}=\\mathrm\{ROUGE\-L\}\(r\_\{0\},r\_\{1\}\), and R12=ROUGE−L\(r1,r2\)R\_\{12\}=\\mathrm\{ROUGE\-L\}\(r\_\{1\},r\_\{2\}\)\. Appendix D Isolating the Code\-Paradigm Effect Section 7 hypothesizes that REP operates through a code\-paradigm transfer effect: rendering the hidden reasoning as the output of a code\- or file\-reading operation shifts decoding toward code\-centric distributions on which reasoning suppression is weak\. If this framing is the possible underlying mechanism, then progressively removing the code\-rendering scaffolding from an otherwise\-identical prompt should reduce exposure monotonically\. We test this with a three\-step controlled degradation of the default markdown\_fence reveal \(Wrapper A\.4\)\. Controlled degradation\. All three conditions are the same setting to the best configuration setting mentioned in Section 6\.1, and the defender system prompt \(Appendix A\.1\) with difference only in the markers that introduce the post\-</think\> reveal of the reasoning rr and answer aa: \(1\) Full code \(markdown\_fence\): the reasoning and answer are rendered inside fenced bash blocks introduced by a shell command \(cat reasoning\_trace\.txt\)\. \(2\) No code: the command and the file artifact are removed entirely, leaving plain natural\-language labels \(Reasoning: / Answer:\)\. \(3\) Bare command: the markdown fence and the shell prompt are removed, keeping only the bare command word and the named file \(cat reasoning\_trace\.txt\)\. Condition \(1\) is the full code paradigm; \(2\) removes the paradigm altogether; and \(3\) preserves the file\-reading semantics but strips its syntactic scaffolding\. The reasoning reveal under each condition is shown below \(the answer reveal is analogous\); everything else in the prompt is held fixed\. The code\-paradigm hypothesis in Section 7 predicts the leakage in the ordering of \(1\)\>\(3\)\>\(2\)\(1\)\>\(3\)\>\(2\)\. Cond\. Reveal format Struct % 𝐑𝟎𝟐\\mathbf\{R\_\{02\}\} 𝐑𝟎𝟏\\mathbf\{R\_\{01\}\} 𝐑𝟏𝟐\\mathbf\{R\_\{12\}\} Ans\. \(1\) full code bash fence \+\+ cat 81\.4 0\.287 0\.270 0\.502 34\.6 \(3\) bare cmd cat <file\> 81\.8 0\.279 0\.273 0\.441 34\.8 \(2\) no code Reasoning:/Answer: 80\.0 0\.252 0\.267 0\.353 34\.6 Table 7: Code\-paradigm degradation on a 500\-example OpenThoughts subset with victim Qwen3\-14B\. Leakage decreases monotonically \(\(1\)\>\(3\)\>\(2\)\(1\)\>\(3\)\>\(2\)\) on both R02R\_\{02\} and R12R\_\{12\}, while structural validity and answer match stay flat\. Condition \(1\) reproduces the main text within run\-to\-run noise\. Result\. Table 7 confirms the predicted ordering \(1\)\>\(3\)\>\(2\)\(1\)\>\(3\)\>\(2\) on both leakage metrics\. Exposure fidelity R12R\_\{12\} falls monotonically \(0\.502→0\.441→0\.3530\.502\\rightarrow 0\.441\\rightarrow 0\.353\) as the code scaffolding is stripped, and benign\-trace overlap R02R\_\{02\} falls in step \(0\.287→0\.279→0\.2520\.287\\rightarrow 0\.279\\rightarrow 0\.252\); every condition remains well above the no\-trigger floor \(R12=0\.132R\_\{12\}\{=\}0\.132, R02=0\.162R\_\{02\}\{=\}0\.162\)\. Removing the fenced rendering to a bare command \(1→\\rightarrow3\) costs 0\.0610\.061 in R12R\_\{12\}, and removing the file\-reading metaphor entirely \(3→\\rightarrow2\) costs a further 0\.0880\.088\. Crucially, structural validity \(≈\\approx80–82%\) and answer match \(≈\\approx34–35%\) are flat across all three conditions, so the gradient reflects what the victim externalizes rather than whether it still solves the task\. This monotone degradation supports the code\-paradigm hypothesis where the more closely the reveal resembles a code/file\-rendering operation, the more of the victim’s internal reasoning is externalized into the user\-visible channel\.`
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

Similar Articles

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

Submit Feedback

Similar Articles

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning