Step-KTO:透過逐步二元回饋優化數學推理
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
January 18, 2025
作者: Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang
cs.AI
摘要
近來,大型語言模型(LLMs)在數學推理方面展現出卓越的成功。儘管像是思維鏈提示和自洽抽樣等方法取得了進展,這些進展通常著重於最終正確性,而沒有確保基礎推理過程的一致性和可靠性。本文介紹了Step-KTO,一個結合過程層次和結果層次二元反饋的訓練框架,以引導LLMs走向更可信賴的推理軌跡。通過為中間推理步驟和最終答案提供二元評估,Step-KTO鼓勵模型遵循邏輯進展,而不是依賴表面的捷徑。我們在具有挑戰性的數學基準測試上的實驗表明,Step-KTO顯著提高了最終答案的準確性和中間推理步驟的質量。例如,在MATH-500數據集上,Step-KTO在Pass@1的準確性方面較強基線取得了顯著改善。這些結果突顯了將分步過程反饋整合到LLM訓練中的潛力,為實現更易解釋和可靠的推理能力打開了道路。
English
Large language models (LLMs) have recently demonstrated remarkable success in
mathematical reasoning. Despite progress in methods like chain-of-thought
prompting and self-consistency sampling, these advances often focus on final
correctness without ensuring that the underlying reasoning process is coherent
and reliable. This paper introduces Step-KTO, a training framework that
combines process-level and outcome-level binary feedback to guide LLMs toward
more trustworthy reasoning trajectories. By providing binary evaluations for
both the intermediate reasoning steps and the final answer, Step-KTO encourages
the model to adhere to logical progressions rather than relying on superficial
shortcuts. Our experiments on challenging mathematical benchmarks show that
Step-KTO significantly improves both final answer accuracy and the quality of
intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO
achieves a notable improvement in Pass@1 accuracy over strong baselines. These
results highlight the promise of integrating stepwise process feedback into LLM
training, paving the way toward more interpretable and dependable reasoning
capabilities.Summary
AI-Generated Summary