Step-KTO: ステップごとの2進フィードバックを通じた数学的推論の最適化

要旨

大規模言語モデル（LLMs）は最近、数学的推論において顕著な成功を示しています。連鎖思考プロンプトや自己整合サンプリングなどの手法の進展にもかかわらず、これらの進歩はしばしば最終的な正確さに焦点を当てており、根底にある推論プロセスが一貫性があり信頼性があることを確認していません。本論文では、Step-KTOというトレーニングフレームワークを紹介し、プロセスレベルとアウトカムレベルのバイナリフィードバックを組み合わせて、LLMsをより信頼性の高い推論軌道に導く方法を提案します。中間の推論ステップと最終的な回答の両方に対してバイナリ評価を提供することで、Step-KTOはモデルが論理的な進行に従うことを奨励し、表面的なショートカットに頼ることを防ぎます。難解な数学のベンチマークでの実験では、Step-KTOが最終的な回答の正確さと中間の推論ステップの質の両方を大幅に改善することが示されました。例えば、MATH-500データセットでは、Step-KTOが強力なベースラインに比べてPass@1の正解率を著しく向上させています。これらの結果は、段階的なプロセスフィードバックをLLMのトレーニングに統合することの可能性を示し、より解釈可能で信頼性のある推論能力への道を開いています。

English

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.