トレーニング前にウォームアップ：リソース制約下での汎用推論能力の解放

要旨

効果的な推論能力を持つ大規模言語モデル（LLM）を設計するには、通常、検証可能な報酬を用いた強化学習（RLVR）や、慎重に選ばれた長い思考連鎖（CoT）を用いた蒸留が必要であり、これらはいずれも大量の学習データに大きく依存します。これは、質の高い学習データが不足している場合に大きな課題となります。本論文では、限られた監督下で推論LLMを開発するための、サンプル効率の良い2段階の学習戦略を提案します。第1段階では、おもちゃのドメインであるKnights & Knaves（K&K）論理パズルから長いCoTを蒸留し、一般的な推論スキルを獲得するためにモデルを「ウォームアップ」します。第2段階では、ウォームアップされたモデルに対して、限られたターゲットドメインの例を用いてRLVRを適用します。実験結果から、この2段階アプローチには以下の利点があることが示されました：(i) ウォームアップフェーズだけで、MATH、HumanEval^{+}、MMLU-Proなどのさまざまなタスクにおいてパフォーマンスが向上する一般化された推論が促進される、(ii) ベースモデルとウォームアップされたモデルの両方を同じ小さなデータセット（100例以下）でRLVR学習させた場合、ウォームアップされたモデルが一貫してベースモデルを上回る、(iii) RLVR学習の前にウォームアップを行うことで、特定のドメインで学習した後も、モデルがクロスドメインの一般化能力を維持できる、(iv) パイプラインにウォームアップを導入することで、RLVR学習中の精度だけでなく、全体的なサンプル効率も向上する。本論文の結果は、データが不足している環境において、ウォームアップが堅牢な推論LLMを構築するための有望な手法であることを示しています。

English

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval^{+}, and MMLU-Pro. (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (leq100 examples), the warmed-up model consistently outperforms the base model; (iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

トレーニング前にウォームアップ：リソース制約下での汎用推論能力の解放

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

要旨

Support