ChatPaper.aiChatPaper

Step-KTO:透過逐步二元回饋優化數學推理

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

January 18, 2025
作者: Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang
cs.AI

摘要

近來,大型語言模型(LLMs)在數學推理方面展現出卓越的成功。儘管像是思維鏈提示和自洽抽樣等方法取得了進展,這些進展通常著重於最終正確性,而沒有確保基礎推理過程的一致性和可靠性。本文介紹了Step-KTO,一個結合過程層次和結果層次二元反饋的訓練框架,以引導LLMs走向更可信賴的推理軌跡。通過為中間推理步驟和最終答案提供二元評估,Step-KTO鼓勵模型遵循邏輯進展,而不是依賴表面的捷徑。我們在具有挑戰性的數學基準測試上的實驗表明,Step-KTO顯著提高了最終答案的準確性和中間推理步驟的質量。例如,在MATH-500數據集上,Step-KTO在Pass@1的準確性方面較強基線取得了顯著改善。這些結果突顯了將分步過程反饋整合到LLM訓練中的潛力,為實現更易解釋和可靠的推理能力打開了道路。
English
Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

Summary

AI-Generated Summary

PDF153January 24, 2025