ThoughtFold: 자기 성찰적 선호 학습을 통한 추론 체인 접기

초록

대규모 추론 모델(LRMs)은 연쇄적 사고(CoTs)에 대한 검증 가능한 보상 기반 강화 학습(RLVR) 덕분에 놀라운 진전을 이루었습니다. 그러나 긴 CoTs는 본질적으로 시행착오를 포함하고 있으며, 주류 RLVR 접근법은 결과적으로 올바른 CoT 궤적을 암기하기 위해 선택하기 때문에, 긴 CoTs 내의 중복 탐색이 필연적으로 강화되어 LRMs의 과도한 사고 문제를 초래합니다. 이 문제를 해결하기 위한 이전 시도들은 주로 더 짧은 궤적에 더 많은 이점을 부여하였지만, 그 학습 신호는 여전히 결과 기반이므로 긴 CoTs에서 중복 탐색의 암기를 줄일 수 없습니다. 따라서 우리는 효율적인 추론을 위해 중복 탐색을 완화하는 세분화된 선호 학습을 활용하는 프레임워크인 ThoughtFold를 제안합니다. ThoughtFold는 각 올바른 궤적 내에서 중복성을 식별하기 위해 내성적 전략을 사용하여, 다양한 후보 하위 궤적 스펙트럼을 생성합니다. 이 스펙트럼을 활용하여, 우리는 중복 탐색에 명시적 패널티를 부여하고 모델이 필수 추론 세그먼트를 직접 연결하도록 장려하는 마스킹된 선호 최적화 목표를 도입함으로써, 추론 체인을 효과적으로 더 간결한 경로로 접습니다. 광범위한 실험은 ThoughtFold가 효율성을 크게 향상시킴을 보여줍니다. 이는 DeepSeek-R1-Distill-Qwen-7B의 토큰 사용량을 약 56% 줄이면서 최첨단 정확도를 유지합니다.

English

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.