思維摺疊:透過內省偏好學習摺疊推理鏈
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
June 2, 2026
作者: Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen
cs.AI
摘要
大型推理模型(LRMs)因在思維鏈(CoTs)上採用基於可驗證獎勵的強化學習(RLVR)而取得了顯著進展。然而,由於長思維鏈自然包含試錯過程,且主流RLVR方法傾向於選擇結果正確的CoT軌跡進行記憶,長思維鏈中的冗餘探索不可避免地得到強化,從而導致LRMs的「過度思考」問題。先前解決此問題的嘗試主要為較短軌跡賦予更多優勢,但其學習信號仍基於結果,無法減少長思維鏈中冗餘探索的記憶。為此,我們提出ThoughtFold框架,利用細粒度偏好學習來減少冗餘探索,以實現高效推理。ThoughtFold採用內省策略,識別每個正確軌跡中的冗餘部分,生成一系列候選子軌跡。基於此譜系,我們引入一種遮蔽偏好優化目標,明確懲罰冗餘探索,並鼓勵模型直接銜接關鍵推理片段,從而有效地將其推理鏈折疊為更簡潔的路徑。大量實驗表明,ThoughtFold顯著提升了效率。它將DeepSeek-R1-Distill-Qwen-7B的Token使用量減少約56%,同時保持最先進的準確率。
English
Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.