ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

arXiv cs.AI Papers

Summary

ThoughtFold proposes a framework using introspective preference learning to reduce redundant explorations in Chain-of-Thought reasoning for Large Reasoning Models, achieving ~56% token reduction on DeepSeek-R1-Distill-Qwen-7B without accuracy loss.

arXiv:2606.03503v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:44 AM

# ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Source: [https://arxiv.org/abs/2606.03503](https://arxiv.org/abs/2606.03503)
[View PDF](https://arxiv.org/pdf/2606.03503)

> Abstract:Large Reasoning Models \(LRMs\) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards \(RLVR\) on Chain\-of\-Thoughts \(CoTs\)\. However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome\-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over\-thinking issues of LRMs\. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome\-based and cannot reduce the memorization of redundant explorations in long CoTs\. Therefore, we propose ThoughtFold, a framework that leverages fine\-grained preference learning to mitigate redundant explorations for efficient reasoning\. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub\-trajectories\. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path\. Extensive experiments show that ThoughtFold significantly enhances efficiency\. It reduces the token usage of DeepSeek\-R1\-Distill\-Qwen\-7B by approximately 56% while maintaining state\-of\-the\-art accuracy\.

## Submission history

From: Ziyan Liu \[[view email](https://arxiv.org/show-email/ea6547b1/2606.03503)\] **\[v1\]**Tue, 2 Jun 2026 11:21:27 UTC \(692 KB\)

Similar Articles

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

arXiv cs.CL

This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

arXiv cs.CL

This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

arXiv cs.AI

This paper investigates how chain-of-thought reasoning in large reasoning models complicates activation-based steering of refusal behavior. Experiments on DeepSeek-R1-Distill-LLaMA-8B show that refusal is jointly encoded in residual stream activations and the CoT trace, making models more robust to activation-level interventions but exposing the CoT as an alternative attack surface.