When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

Hugging Face Daily Papers Papers

Summary

This paper introduces Side-by-Side Interleaved Reasoning, a method for controlling disclosure timing in autoregressive models to improve accuracy and efficiency. It demonstrates improved performance on benchmarks using Qwen3 models by interleaving private reasoning with partial disclosures.

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a silence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce Side-by-Side (SxS) Interleaved Reasoning, which makes disclosure timing a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is supported by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE Qwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--content-latency Pareto trade-offs under token-level proxies such as inter-update waiting.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:00 AM

Paper page - When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

Source: https://huggingface.co/papers/2605.03314

Abstract

Side-by-Side Interleaved Reasoning enables controlled disclosure timing in autoregressive models, improving accuracy and efficiency through interleaved private reasoning and delayed content release.

In single-streamautoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates asilence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce Side-by-Side (SxS)Interleaved Reasoning, which makes disclosure timing a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continuedprivate reasoningin the same context, but releases content only when it is supported by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoEQwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--content-latency Pareto trade-offs under token-level proxies such as inter-update waiting.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.03314

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.03314 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.03314 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.03314 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

arXiv cs.AI

This paper introduces Reasoning Exposure Prompting (REP), a method that uses shadow-model demonstrations in code-like formats to elicit hidden reasoning traces from LLMs, showing that interface-level trace hiding is insufficient to prevent extraction of useful reasoning signals.

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

arXiv cs.CL

IDPR is a framework for response-conditioned inhibitory deliberation that first generates a fast intuitive answer, then uses an inhibition controller to decide whether to invoke slow reasoning, achieving efficiency gains while maintaining accuracy.

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

arXiv cs.CL

This paper addresses the 'Lost in Conversation' problem where LLMs struggle with information revealed across multiple turns. It proposes a scalable sharding pipeline to create multi-turn training data from single-turn QA datasets and uses reinforcement learning with verifiable rewards to train a memory-augmented policy that maintains a compact rolling memory, improving multi-turn reasoning accuracy and generalizing zero-shot to harder tasks.

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.