훈련 전 워밍업: 자원 제약 환경에서 일반 추론 능력의 잠금 해제

초록

효과적인 추론 능력을 갖춘 대형 언어 모델(LLM)을 설계하려면 일반적으로 검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR) 또는 신중하게 선별된 장기 사고 사슬(Long Chain of Thoughts, CoT)을 통한 지식 증류(distillation)를 사용한 학습이 필요합니다. 이 두 방법 모두 방대한 양의 학습 데이터에 크게 의존하므로, 고품질 학습 데이터가 부족한 상황에서는 주요한 도전 과제로 작용합니다. 본 연구에서는 제한된 감독 하에서 추론 LLM을 개발하기 위해 샘플 효율적인 2단계 학습 전략을 제안합니다. 첫 번째 단계에서는 장기 CoT를 장난감 도메인인 Knights & Knaves(K&K) 논리 퍼즐로부터 증류하여 일반적인 추론 능력을 습득함으로써 모델을 "워밍업"합니다. 두 번째 단계에서는 워밍업된 모델에 제한된 수의 대상 도메인 예제를 사용하여 RLVR을 적용합니다. 실험 결과, 이 2단계 접근법은 다음과 같은 여러 이점을 제공함을 보여줍니다: (i) 워밍업 단계만으로도 일반화된 추론 능력을 촉진하여 MATH, HumanEval⁺, MMLU-Pro 등 다양한 작업에서 성능 향상을 이끌어냅니다. (ii) 기본 모델과 워밍업된 모델이 동일한 소규모 데이터셋(≤100개의 예제)에서 RLVR 학습을 진행할 때, 워밍업된 모델이 기본 모델을 지속적으로 능가합니다. (iii) RLVR 학습 전에 워밍업을 수행하면 특정 도메인에서 학습한 후에도 도메인 간 일반화 능력을 유지할 수 있습니다. (iv) 워밍업을 학습 파이프라인에 도입하면 정확도뿐만 아니라 RLVR 학습 중 전반적인 샘플 효율성도 개선됩니다. 본 논문의 결과는 데이터가 부족한 환경에서 견고한 추론 LLM을 구축하기 위해 워밍업이 유망한 접근법임을 강조합니다.

English

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval^{+}, and MMLU-Pro. (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (leq100 examples), the warmed-up model consistently outperforms the base model; (iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

훈련 전 워밍업: 자원 제약 환경에서 일반 추론 능력의 잠금 해제

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

초록

Support