Step-KTO：透過逐步二元回饋優化數學推理

摘要

近來，大型語言模型（LLMs）在數學推理方面展現出卓越的成功。儘管像是思維鏈提示和自洽抽樣等方法取得了進展，這些進展通常著重於最終正確性，而沒有確保基礎推理過程的一致性和可靠性。本文介紹了Step-KTO，一個結合過程層次和結果層次二元反饋的訓練框架，以引導LLMs走向更可信賴的推理軌跡。通過為中間推理步驟和最終答案提供二元評估，Step-KTO鼓勵模型遵循邏輯進展，而不是依賴表面的捷徑。我們在具有挑戰性的數學基準測試上的實驗表明，Step-KTO顯著提高了最終答案的準確性和中間推理步驟的質量。例如，在MATH-500數據集上，Step-KTO在Pass@1的準確性方面較強基線取得了顯著改善。這些結果突顯了將分步過程反饋整合到LLM訓練中的潛力，為實現更易解釋和可靠的推理能力打開了道路。

English

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

Step-KTO：透過逐步二元回饋優化數學推理

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

摘要

Support