ChatPaper.aiChatPaper

SuperCorrect:使用基于错误驱动的洞察对语言模型进行监督和校正

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

October 11, 2024
作者: Ling Yang, Zhaochen Yu, Tianjun Zhang, Minkai Xu, Joseph E. Gonzalez, Bin Cui, Shuicheng Yan
cs.AI

摘要

像GPT-4、PaLM和LLaMA这样的大型语言模型已经在各种推理任务中展现出显著的改进。然而,像Llama-3-8B和DeepSeekMath-Base这样的较小模型仍然在复杂数学推理方面遇到困难,因为它们未能有效地识别和纠正推理错误。最近基于反思的方法旨在通过启用自我反思和自我纠正来解决这些问题,但它们仍然面临着在推理步骤中独立检测错误的挑战。为了克服这些限制,我们提出了SuperCorrect,这是一个新颖的两阶段框架,利用大型教师模型监督和纠正较小学生模型的推理和反思过程。在第一阶段,我们从教师模型中提取分层高级和详细的思维模板,以指导学生模型引出更精细的推理思路。在第二阶段,我们引入跨模型协作直接偏好优化(DPO),通过在训练过程中遵循教师的纠正痕迹,增强学生模型的自我纠正能力。这种跨模型DPO方法教导学生模型有效地定位和解决错误思维,通过教师模型的错误驱动见解打破其思维的瓶颈,获取解决具有挑战性问题的新技能和知识。大量实验证明了我们相对于先前方法的优越性。值得注意的是,我们的SuperCorrect-7B模型在MATH/GSM8K基准测试中显著超越了强大的DeepSeekMath-7B模型,分别提高了7.8%/5.3%和Qwen2.5-Math-7B模型15.1%/6.3%,在所有7B模型中实现了新的SOTA性能。源代码:https://github.com/YangLing0818/SuperCorrect-llm
English
Large language models (LLMs) like GPT-4, PaLM, and LLaMA have shown significant improvements in various reasoning tasks. However, smaller models such as Llama-3-8B and DeepSeekMath-Base still struggle with complex mathematical reasoning because they fail to effectively identify and correct reasoning errors. Recent reflection-based methods aim to address these issues by enabling self-reflection and self-correction, but they still face challenges in independently detecting errors in their reasoning steps. To overcome these limitations, we propose SuperCorrect, a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems. Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our SuperCorrect-7B model significantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models. Code: https://github.com/YangLing0818/SuperCorrect-llm
PDF173November 16, 2024