スーパーコレクト：エラー駆動型の洞察を用いた言語モデルの監督と修正

要旨

GPT-4、PaLM、およびLLaMAなどの大規模言語モデル（LLMs）は、さまざまな推論タスクで著しい改善を示しています。ただし、Llama-3-8BやDeepSeekMath-Baseなどのより小さなモデルは、複雑な数学的推論に苦労しています。これは、推論エラーを効果的に特定および修正できないためです。最近の反射ベースの手法は、自己反省と自己修正を可能にすることで、これらの問題に対処しようとしていますが、推論ステップでのエラーを独立して検出する際にはまだ課題があります。これらの制限を克服するために、私たちはSuperCorrectという新しい2段階フレームワークを提案します。このフレームワークは、大規模な教師モデルを使用して、より小さな学習モデルの推論および反省プロセスの両方を監督および修正します。最初の段階では、教師モデルから階層的な高レベルおよび詳細な思考テンプレートを抽出して、学習モデルを導き、より細かい推論思考を引き出すようにします。2番目の段階では、クロスモデル協調直接的な選好最適化（DPO）を導入して、学習モデルの自己修正能力を向上させます。これにより、トレーニング中に教師モデルの修正トレースに従うことで、学習モデルに誤った思考を効果的に特定および解決する方法を教えます。このクロスモデルDPOアプローチにより、学習モデルは教師モデルからのエラー駆動の洞察によって思考のボトルネックを打破し、難しい問題に取り組むための新しいスキルと知識を獲得します。包括的な実験は、従来の手法よりも優れていることを一貫して示しています。特に、当社のSuperCorrect-7Bモデルは、MATH/GSM8Kベンチマークにおいて、強力なDeepSeekMath-7Bを7.8%/5.3%、Qwen2.5-Math-7Bを15.1%/6.3% 上回り、すべての7Bモデルの中で新しいSOTAパフォーマンスを達成しています。コード: https://github.com/YangLing0818/SuperCorrect-llm

English

Large language models (LLMs) like GPT-4, PaLM, and LLaMA have shown significant improvements in various reasoning tasks. However, smaller models such as Llama-3-8B and DeepSeekMath-Base still struggle with complex mathematical reasoning because they fail to effectively identify and correct reasoning errors. Recent reflection-based methods aim to address these issues by enabling self-reflection and self-correction, but they still face challenges in independently detecting errors in their reasoning steps. To overcome these limitations, we propose SuperCorrect, a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems. Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our SuperCorrect-7B model significantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models. Code: https://github.com/YangLing0818/SuperCorrect-llm

スーパーコレクト：エラー駆動型の洞察を用いた言語モデルの監督と修正

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

要旨

Support