슈퍼코렉트: 오류 주도 통찰을 활용한 언어 모델의 감독 및 교정

초록

GPT-4, PaLM 및 LLaMA과 같은 대형 언어 모델(LLMs)은 다양한 추론 작업에서 상당한 개선을 보여주었습니다. 그러나 Llama-3-8B 및 DeepSeekMath-Base와 같은 작은 모델들은 여전히 복잡한 수학적 추론에 어려움을 겪고 있습니다. 이는 추론 오류를 효과적으로 식별하고 수정하지 못하기 때문입니다. 최근 반성 기반 방법은 자가 반성 및 자가 수정을 가능하게 함으로써 이러한 문제를 해결하려고 노력하고 있지만, 여전히 추론 단계에서의 오류를 독립적으로 감지하는 데 어려움을 겪고 있습니다. 이러한 제한을 극복하기 위해 우리는 SuperCorrect라는 새로운 이중 단계 프레임워크를 제안합니다. 이 프레임워크는 대형 교사 모델을 활용하여 작은 학생 모델의 추론 및 반성 프로세스를 감독하고 수정합니다. 첫 번째 단계에서는 교사 모델로부터 계층적인 고수준 및 상세한 사고 템플릿을 추출하여 학생 모델이 보다 세분화된 추론 사고를 유도하도록 안내합니다. 두 번째 단계에서는 교사의 수정 흔적을 따라 교육 중에 학생 모델의 자가 수정 능력을 향상시키기 위해 교모협력 직접 선호 최적화(DPO)를 도입합니다. 이 교모협력 DPO 접근 방식은 학생 모델이 교사 모델로부터의 오류 주도 통찰을 통해 효과적으로 잘못된 사고를 찾아 해결하도록 가르치며, 그 사고의 병목 현상을 깨고 어려운 문제에 대처하기 위한 새로운 기술과 지식을 습득하게 합니다. 광범위한 실험에서 우리의 우수성을 일관되게 입증합니다. 특히, SuperCorrect-7B 모델은 MATH/GSM8K 벤치마크에서 모든 7B 모델 중 새로운 SOTA 성능을 달성하며 강력한 DeepSeekMath-7B보다 7.8%/5.3%, Qwen2.5-Math-7B보다 15.1%/6.3% 우수한 성과를 보입니다. 코드: https://github.com/YangLing0818/SuperCorrect-llm

English

Large language models (LLMs) like GPT-4, PaLM, and LLaMA have shown significant improvements in various reasoning tasks. However, smaller models such as Llama-3-8B and DeepSeekMath-Base still struggle with complex mathematical reasoning because they fail to effectively identify and correct reasoning errors. Recent reflection-based methods aim to address these issues by enabling self-reflection and self-correction, but they still face challenges in independently detecting errors in their reasoning steps. To overcome these limitations, we propose SuperCorrect, a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems. Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our SuperCorrect-7B model significantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models. Code: https://github.com/YangLing0818/SuperCorrect-llm

슈퍼코렉트: 오류 주도 통찰을 활용한 언어 모델의 감독 및 교정

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

초록

Support