Flow-DPO: オンラインマルチエージェント学習を通じたLLM数学的推論の改善

要旨

数学的推論能力は、大規模言語モデル（LLMs）にとって重要な能力ですが、詳細で正確な推論トレースを生成することは依然として大きな課題です。本論文では、オンライン学習フローを使用して、LLMの微調整のための高品質な推論トレースを生成する革新的なアプローチを紹介します。当該手法は、成分LLMsがイテレーションを通じて協力して解決策を構築するインクリメンタルな出力生成フローを採用しています。我々は、オンラインダイレクトプリファレンス最適化（DPO）学習を使用してこのフローをトレーニングし、各トレーニング例に対してDPOペアを生成し、モデルをリアルタイムで更新しています。我々の手法によって生成された推論トレースの品質を直接モデル推論によって生成されたものと比較し、数学的推論タスクにおけるLLMのパフォーマンスを向上させる我々のアプローチの効果を実証しています。

English

Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning Flows. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

Flow-DPO: オンラインマルチエージェント学習を通じたLLM数学的推論の改善

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

要旨

Support