When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
Summary
This paper introduces Side-by-Side Interleaved Reasoning, a method for controlling disclosure timing in autoregressive models to improve accuracy and efficiency. It demonstrates improved performance on benchmarks using Qwen3 models by interleaving private reasoning with partial disclosures.
View Cached Full Text
Cached at: 05/08/26, 08:00 AM
Paper page - When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
Source: https://huggingface.co/papers/2605.03314
Abstract
Side-by-Side Interleaved Reasoning enables controlled disclosure timing in autoregressive models, improving accuracy and efficiency through interleaved private reasoning and delayed content release.
In single-streamautoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates asilence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce Side-by-Side (SxS)Interleaved Reasoning, which makes disclosure timing a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continuedprivate reasoningin the same context, but releases content only when it is supported by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoEQwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--content-latency Pareto trade-offs under token-level proxies such as inter-update waiting.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.03314
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.03314 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.03314 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.03314 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
This paper introduces Reasoning Exposure Prompting (REP), a method that uses shadow-model demonstrations in code-like formats to elicit hidden reasoning traces from LLMs, showing that interface-level trace hiding is insufficient to prevent extraction of useful reasoning signals.
When to Think Deeply: Inhibitory Deliberation for LLM Reasoning
IDPR is a framework for response-conditioned inhibitory deliberation that first generates a fast intuitive answer, then uses an inhibition controller to decide whether to invoke slow reasoning, achieving efficiency gains while maintaining accuracy.
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
This paper addresses the 'Lost in Conversation' problem where LLMs struggle with information revealed across multiple turns. It proposes a scalable sharding pipeline to create multi-turn training data from single-turn QA datasets and uses reinforcement learning with verifiable rewards to train a memory-augmented policy that maintains a compact rolling memory, improving multi-turn reasoning accuracy and generalizing zero-shot to harder tasks.
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
This paper introduces Motab, a new pipeline for LLM reasoning distillation that mitigates both off-policy and on-policy exposure biases by dynamically monitoring student generation and backtracking to safe states with teacher intervention, achieving ~3% average improvement.
Learning to Refine Hidden States for Reliable LLM Reasoning
Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.