ステップ制御DPO：数学的推論の強化のための段階的エラーの活用

要旨

直接選好最適化（Direct Preference Optimization, DPO）は、推論やアラインメントなどの下流タスクにおける大規模言語モデル（LLMs）の性能向上に効果的であることが証明されています。本研究では、ステップ制御型DPO（Step-Controlled DPO, SCDPO）を提案します。これは、指定されたステップで誤りを始める数学的推論の根拠のネガティブサンプルを作成することで、段階的な誤り監視を自動的に提供する手法です。これらのサンプルをDPOトレーニングに適用することで、SCDPOはモデルをより適切にアラインメントし、推論エラーを理解し、正確な推論ステップを出力する能力を向上させます。SCDPOをコード統合型および連鎖思考型のソリューションに適用し、既存のSFTモデルと私たちがファインチューニングした2つのモデルを含む3つの異なるSFTモデルにおいて、単純なDPOと比較して一貫して性能が向上することを実証しました。SCDPOとDPOのクレジット割り当ての定性的分析は、SCDPOが数学的ソリューションのエラーを特定する効果を示しています。その後、SCDPOをInternLM2-20Bモデルに適用し、GSM8Kで88.5%、MATHで58.1%の高スコアを達成する20Bモデルを作成しました。これは他のオープンソースLLMsと肩を並べるものであり、私たちの手法の大きな可能性を示しています。

English

Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.

ステップ制御DPO：数学的推論の強化のための段階的エラーの活用

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

要旨

Support