小型モデルは強力な推論器から学習するのに苦戦する

要旨

大規模言語モデル（LLM）は複雑な推論タスクにおいて優れた性能を発揮し、その推論能力をより小さなモデルに蒸留することは有望なアプローチとして注目されています。しかし、我々は「Small Model Learnability Gap（小型モデルの学習可能性ギャップ）」と呼ばれる興味深い現象を発見しました。具体的には、パラメータ数が3B以下の小型モデルは、長い連鎖的思考（CoT）推論や大規模モデルからの蒸留から一貫して恩恵を受けるわけではないことが明らかになりました。むしろ、これらの小型モデルは、その内在的な学習能力に適した、より短くシンプルな推論連鎖でファインチューニングした場合に、より良い性能を発揮します。この問題に対処するため、我々は「Mix Distillation（混合蒸留）」を提案します。これは、長いCoT例と短いCoT例、あるいは大規模モデルと小型モデルの推論を組み合わせることで、推論の複雑さをバランスさせるシンプルかつ効果的な戦略です。実験の結果、Mix Distillationは、単一のデータのみでトレーニングした場合と比較して、小型モデルの推論性能を大幅に向上させることが示されました。これらの知見は、強力なモデルからの直接的な蒸留の限界を浮き彫りにし、効果的な推論能力の転送のためには推論の複雑さを適応させることが重要であることを強調しています。

English

Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models (leq3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

小型モデルは強力な推論器から学習するのに苦戦する

Small Models Struggle to Learn from Strong Reasoners

要旨

Support