SuperCorrect：使用錯誤驅動的見解監督和校正語言模型

摘要

大型語言模型（LLMs）如GPT-4、PaLM和LLaMA在各種推理任務中顯示出顯著的改進。然而，較小的模型，如Llama-3-8B和DeepSeekMath-Base，在複雜數學推理方面仍然面臨困難，因為它們無法有效地識別和糾正推理錯誤。最近基於反思的方法旨在解決這些問題，通過啟用自我反思和自我糾正，但它們仍然面臨獨立檢測推理步驟中錯誤的挑戰。為了克服這些限制，我們提出了SuperCorrect，這是一個新穎的兩階段框架，使用一個大型教師模型來監督和糾正較小學生模型的推理和反思過程。在第一階段，我們從教師模型中提取分層高級和詳細的思維模板，以引導學生模型引出更精細的推理思維。在第二階段，我們引入跨模型協作直接偏好優化（DPO）來增強學生模型的自我糾正能力，通過在訓練期間遵循教師的糾正軌跡。這種跨模型DPO方法教導學生模型有效地定位和解決錯誤思維，並從教師模型的錯誤驅動見解中獲取新技能和知識，打破其思維的瓶頸，應對具有挑戰性的問題。廣泛的實驗一致表明我們優於以往的方法。值得注意的是，我們的SuperCorrect-7B模型在MATH/GSM8K基準測試中顯著超越強大的DeepSeekMath-7B分別達到7.8%/5.3%和Qwen2.5-Math-7B分別達到15.1%/6.3%的性能，成為所有7B模型中新的SOTA性能。程式碼：https://github.com/YangLing0818/SuperCorrect-llm

English

Large language models (LLMs) like GPT-4, PaLM, and LLaMA have shown significant improvements in various reasoning tasks. However, smaller models such as Llama-3-8B and DeepSeekMath-Base still struggle with complex mathematical reasoning because they fail to effectively identify and correct reasoning errors. Recent reflection-based methods aim to address these issues by enabling self-reflection and self-correction, but they still face challenges in independently detecting errors in their reasoning steps. To overcome these limitations, we propose SuperCorrect, a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems. Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our SuperCorrect-7B model significantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models. Code: https://github.com/YangLing0818/SuperCorrect-llm

SuperCorrect：使用錯誤驅動的見解監督和校正語言模型

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

摘要

Support