ThoughtFold:通过内省偏好学习折叠推理链
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
June 2, 2026
作者: Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen
cs.AI
摘要
大型推理模型(LRMs)得益于基于思维链(CoTs)的可验证奖励强化学习(RLVR),取得了显著进展。然而,由于长思维链天然包含试错过程,而主流RLVR方法选择结果正确的思维链轨迹进行记忆,长思维链中的冗余探索不可避免地得到强化,从而导致LRMs的“过度思考”问题。先前解决该问题的尝试主要倾向于给予更短轨迹更多优势,但其学习信号仍基于结果,无法减少对长思维链中冗余探索的记忆。为此,我们提出ThoughtFold框架,利用细粒度的偏好学习来减少冗余探索,实现高效推理。ThoughtFold采用内省策略识别每个正确轨迹中的冗余,生成一系列候选子轨迹。基于这一谱系,我们引入一种掩码偏好优化目标,明确惩罚冗余探索,并鼓励模型直接连接关键推理片段,从而有效地将其推理链折叠为更简洁的路径。大量实验表明,ThoughtFold显著提升了效率。它使DeepSeek-R1-Distill-Qwen-7B的令牌使用量减少约56%,同时保持了最先进的准确性。
English
Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.