ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

arXiv cs.AI 06/03/26, 04:00 AM Papers

Summary

ThoughtFold proposes a framework using introspective preference learning to reduce redundant explorations in Chain-of-Thought reasoning for Large Reasoning Models, achieving ~56% token reduction on DeepSeek-R1-Distill-Qwen-7B without accuracy loss.

arXiv:2606.03503v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

Original Article

View Cached Full Text

Cached at: 06/03/26, 09:44 AM

# ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Source: [https://arxiv.org/abs/2606.03503](https://arxiv.org/abs/2606.03503)
[View PDF](https://arxiv.org/pdf/2606.03503)

> Abstract:Large Reasoning Models \(LRMs\) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards \(RLVR\) on Chain\-of\-Thoughts \(CoTs\)\. However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome\-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over\-thinking issues of LRMs\. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome\-based and cannot reduce the memorization of redundant explorations in long CoTs\. Therefore, we propose ThoughtFold, a framework that leverages fine\-grained preference learning to mitigate redundant explorations for efficient reasoning\. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub\-trajectories\. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path\. Extensive experiments show that ThoughtFold significantly enhances efficiency\. It reduces the token usage of DeepSeek\-R1\-Distill\-Qwen\-7B by approximately 56% while maintaining state\-of\-the\-art accuracy\.

## Submission history

From: Ziyan Liu \[[view email](https://arxiv.org/show-email/ea6547b1/2606.03503)\] **\[v1\]**Tue, 2 Jun 2026 11:21:27 UTC \(692 KB\)

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Similar Articles

TabRank: Chain-of-Thought Distillation for Table Re-Rankers

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

Reasoning Fine-Tuning Induces Persistent Latent Policy States

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Submit Feedback

Similar Articles

TabRank: Chain-of-Thought Distillation for Table Re-Rankers

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

Reasoning Fine-Tuning Induces Persistent Latent Policy States

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do