LightReasoner: 소규모 언어 모델이 대규모 언어 모델에게 추론을 가르칠 수 있을까?

초록

대형 언어 모델(LLMs)은 지도 미세 조정(SFT)을 통해 추론 능력에서 놀라운 진전을 보여왔습니다. 그러나 SFT는 대규모로 정제된 데이터셋, 거부 샘플링된 데모, 그리고 모든 토큰에 걸친 균일한 최적화에 의존하는 등 자원 집약적인 과정입니다. 이는 비록 일부 토큰만이 의미 있는 학습 가치를 지니고 있음에도 불구하고 그러한 방식으로 진행됩니다. 본 연구에서는 직관에 반대되는 아이디어를 탐구합니다: 더 작은 언어 모델(SLMs)이 더 큰 언어 모델(LLMs)에게 고가치 추론 순간을 드러내어 후자의 독특한 강점을 반영함으로써 가르칠 수 있을까요? 우리는 강력한 전문가 모델(LLM)과 약한 아마추어 모델(SLM) 간의 행동적 차이를 활용하는 새로운 프레임워크인 LightReasoner를 제안합니다. LightReasoner는 두 단계로 작동합니다: (1) 전문가와 아마추어의 대비를 통해 전문가의 우위를 포착한 감독 예제를 구성하며, 중요한 추론 순간을 정확히 찾아내는 샘플링 단계, 그리고 (2) 이러한 정제된 예제에 맞춰 전문가 모델을 조정하여 그 추론 강점을 증폭시키는 미세 조정 단계입니다. 일곱 개의 수학 벤치마크에서 LightReasoner는 정확도를 최대 28.1%까지 향상시키면서도 시간 소모를 90%, 샘플링된 문제를 80%, 조정된 토큰 사용량을 99%까지 줄였습니다. 이 모든 것이 지상 진실 레이블에 의존하지 않고 이루어졌습니다. 더 약한 SLM을 효과적인 교수 신호로 전환함으로써, LightReasoner는 LLM 추론을 발전시키기 위한 확장 가능하고 자원 효율적인 접근 방식을 제공합니다. 코드는 https://github.com/HKUDS/LightReasoner에서 확인할 수 있습니다.

English

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

LightReasoner: 소규모 언어 모델이 대규모 언어 모델에게 추론을 가르칠 수 있을까?

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

초록

Support