思維偏好優化
Thinking Preference Optimization
February 17, 2025
作者: Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han
cs.AI
摘要
監督式微調(Supervised Fine-Tuning, SFT)一直是提升相對較小的大型語言模型(LLMs)長鏈思維(Chain-of-Thought, CoT)推理能力的有效方法,通過使用來自更大LLMs的長CoT回應進行微調。為了持續提升推理能力,我們可以收集新的高質量長CoT推理SFT數據,或重複訓練現有的SFT數據集。然而,獲取新的長CoT SFT數據成本高昂且有限,而重複訓練往往導致性能停滯或下降。為了進一步利用SFT數據提升性能,我們提出了思維偏好優化(Thinking Preference Optimization, ThinkPO),這是一種簡單而有效的後SFT方法,無需新的長CoT回應即可增強長CoT推理。ThinkPO利用現成或易於獲取的短CoT推理回應作為被拒絕的答案,並將長CoT回應作為同一問題的選擇答案,然後應用直接偏好優化,鼓勵模型偏好更長的推理輸出。實驗表明,ThinkPO進一步提升了SFT模型的推理性能,例如,它將SFT模型的數學推理準確率提高了8.6%,輸出長度增加了25.9%。值得注意的是,ThinkPO能夠持續提升公開蒸餾的SFT模型的性能,例如,將官方DeepSeek-R1-Distill-Qwen-7B在MATH500上的表現從87.4%提升至91.2%。
English
Supervised Fine-Tuning (SFT) has been a go-to and effective method for
enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by
fine-tuning them with long CoT responses from larger LLMs. To continually
improve reasoning abilities, we can either collect new high-quality long CoT
reasoning SFT data or repeatedly train on existing SFT datasets. However,
acquiring new long CoT SFT data is costly and limited, while repeated training
often results in a performance plateau or decline. To further boost the
performance with the SFT data, we propose Thinking Preference Optimization
(ThinkPO), a simple yet effective post-SFT method that enhances long CoT
reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes
readily available or easily obtainable short CoT reasoning responses as
rejected answers and long CoT responses as chosen answers for the same
question. It then applies direct preference optimization to encourage the model
to favor longer reasoning outputs. Experiments show that ThinkPO further
improves the reasoning performance of SFT-ed models, e.g. it increases math
reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%.
Notably, ThinkPO is capable of continually boosting the performance of the
publicly distilled SFT model, e.g., increasing the official
DeepSeek-R1-Distill-Qwen-7B's performance on MATH500 from 87.4% to 91.2%.Summary
AI-Generated Summary