思維摺疊：透過內省偏好學習摺疊推理鏈

摘要

大型推理模型（LRMs）因在思維鏈（CoTs）上採用基於可驗證獎勵的強化學習（RLVR）而取得了顯著進展。然而，由於長思維鏈自然包含試錯過程，且主流RLVR方法傾向於選擇結果正確的CoT軌跡進行記憶，長思維鏈中的冗餘探索不可避免地得到強化，從而導致LRMs的「過度思考」問題。先前解決此問題的嘗試主要為較短軌跡賦予更多優勢，但其學習信號仍基於結果，無法減少長思維鏈中冗餘探索的記憶。為此，我們提出ThoughtFold框架，利用細粒度偏好學習來減少冗餘探索，以實現高效推理。ThoughtFold採用內省策略，識別每個正確軌跡中的冗餘部分，生成一系列候選子軌跡。基於此譜系，我們引入一種遮蔽偏好優化目標，明確懲罰冗餘探索，並鼓勵模型直接銜接關鍵推理片段，從而有效地將其推理鏈折疊為更簡潔的路徑。大量實驗表明，ThoughtFold顯著提升了效率。它將DeepSeek-R1-Distill-Qwen-7B的Token使用量減少約56%，同時保持最先進的準確率。

English

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.