ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Summary
ThoughtFold proposes a framework using introspective preference learning to reduce redundant explorations in Chain-of-Thought reasoning for Large Reasoning Models, achieving ~56% token reduction on DeepSeek-R1-Distill-Qwen-7B without accuracy loss.
View Cached Full Text
Cached at: 06/03/26, 09:44 AM
# ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning Source: [https://arxiv.org/abs/2606.03503](https://arxiv.org/abs/2606.03503) [View PDF](https://arxiv.org/pdf/2606.03503) > Abstract:Large Reasoning Models \(LRMs\) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards \(RLVR\) on Chain\-of\-Thoughts \(CoTs\)\. However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome\-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over\-thinking issues of LRMs\. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome\-based and cannot reduce the memorization of redundant explorations in long CoTs\. Therefore, we propose ThoughtFold, a framework that leverages fine\-grained preference learning to mitigate redundant explorations for efficient reasoning\. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub\-trajectories\. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path\. Extensive experiments show that ThoughtFold significantly enhances efficiency\. It reduces the token usage of DeepSeek\-R1\-Distill\-Qwen\-7B by approximately 56% while maintaining state\-of\-the\-art accuracy\. ## Submission history From: Ziyan Liu \[[view email](https://arxiv.org/show-email/ea6547b1/2606.03503)\] **\[v1\]**Tue, 2 Jun 2026 11:21:27 UTC \(692 KB\)
Similar Articles
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.
Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning
Introduces Thoughts-as-Planning, a framework that models chain-of-thought optimization as sequential decision-making using latent world models and reinforcement learning, outperforming existing methods in efficiency and generalization.
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
This paper investigates how chain-of-thought reasoning in large reasoning models complicates activation-based steering of refusal behavior. Experiments on DeepSeek-R1-Distill-LLaMA-8B show that refusal is jointly encoded in residual stream activations and the CoT trace, making models more robust to activation-level interventions but exposing the CoT as an alternative attack surface.