부정 신호 활용: LLM 추론을 위한 교사 데이터의 강화 증류

초록

최근 모델 증류 기술의 발전은 고급 추론 모델(예: DeepSeek-R1, OpenAI의 o1)의 데이터를 활용하여 복잡한 추론 능력을 더 작고 효율적인 학생 모델로 효과적으로 전이할 수 있음을 보여주고 있다. 그러나 일반적인 관행은 잘못된 추론 예제를 폐기하는 거부 샘플링을 사용하며, 이는 가치 있지만 종종 활용되지 않는 데이터이다. 본 논문은 오프라인 환경에서 LLM의 추론 성능을 극대화하기 위해 긍정적 및 부정적 증류 추론 흔적을 효과적으로 활용할 수 있는 방법에 대한 중요한 질문을 다룬다. 이를 위해, 우리는 두 단계 프레임워크인 강화 증류(Reinforcement Distillation, REDI)를 제안한다. 1단계에서는 지도 미세 조정(Supervised Fine-Tuning, SFT)을 통해 긍정적 흔적을 학습한다. 2단계에서는 제안된 REDI 목적 함수를 사용하여 긍정적 및 부정적 흔적을 모두 활용하여 모델을 추가로 개선한다. 이 새로운 목적 함수는 단순하고 참조가 필요 없는 손실 함수로, 이 증류 맥락에서 DPO 및 SimPO와 같은 기존 방법을 능가한다. 우리의 실험적 평가는 수학적 추론 작업에서 REDI가 기준선인 거부 샘플링 SFT 또는 SFT와 DPO/SimPO를 결합한 방법보다 우수함을 보여준다. 특히, Open-R1 데이터셋의 단 131k개의 긍정적 및 부정적 예제를 추가 학습한 Qwen-REDI-1.5B 모델은 MATH-500(pass@1)에서 83.1%의 점수를 달성했다. 이 모델의 성능은 다양한 수학적 추론 벤치마크에서 800k개의 독점 데이터를 추가 학습한 DeepSeek-R1-Distill-Qwen-1.5B 모델과 동등하거나 이를 능가하며, 공개적으로 이용 가능한 데이터를 사용하여 오프라인에서 추가 학습된 1.5B 모델의 새로운 최첨단 기술을 확립했다.

English

Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.

부정 신호 활용: LLM 추론을 위한 교사 데이터의 강화 증류

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

초록

Support