ThoughtFold：内省的選好学習による推論連鎖の折り畳み

要旨

大規模推論モデル（LRMs）は、思考連鎖（CoTs）に対する検証可能報酬を用いた強化学習（RLVR）によって顕著な進歩を遂げてきた。しかしながら、長いCoTには本質的に試行錯誤が含まれており、主流のRLVRアプローチは結果が正しいCoT軌跡を記憶のために選択するため、長いCoT内の冗長な探索が不可避的に強化され、その結果LRMの過剰思考問題を引き起こす。この問題を解決するための従来の試みは主に短い軌跡に有利になるようにしていたが、それらの学習信号は依然として結果ベースであり、長いCoTにおける冗長な探索の記憶化を低減することはできなかった。そこで我々は、効率的な推論のために冗長な探索を軽減する、きめ細かい嗜好学習を活用したフレームワークであるThoughtFoldを提案する。ThoughtFoldは内省的な戦略を用いて、各正しい軌跡内の冗長性を特定し、これにより一連の候補サブ軌跡を得る。この一連のサブ軌跡を活用して、冗長な探索を明示的に罰し、モデルが本質的な推論セグメントを直接橋渡しすることを促す、マスク付き嗜好最適化目的関数を導入する。これにより、推論連鎖をより簡潔な経路に効果的に折りたたむ。広範な実験により、ThoughtFoldが効率を大幅に向上させることが示された。DeepSeek-R1-Distill-Qwen-7Bのトークン使用量を約56%削減しつつ、最先端の精度を維持する。

English

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.