推論はいつ重要か？モデル性能に対する推論の貢献に関する制御実験

要旨

推論能力を備えた大規模言語モデル（LLMs）は、幅広いタスクにおいて最先端の性能を達成しています。その実証的な成功にもかかわらず、推論が効果を発揮するタスクやモデル規模、およびその学習と推論コストについては、まだ十分に検討されていません。本研究では、合成データ蒸留フレームワークを活用し、大規模な教師あり学習の調査を行います。数学中心および汎用タスクにおいて、複数選択形式と自由回答形式の両方で、様々なサイズの指示微調整（IFT）モデルと推論モデルを比較します。分析の結果、推論は一貫してモデルの性能を向上させ、しばしば大幅に大きなIFTシステムに匹敵またはそれを上回ることが明らかになりました。特に、IFTは学習と推論コストにおいてパレート最適である一方、推論モデルはモデル規模が大きくなるにつれてその価値を増し、推論集約型および自由回答型タスクにおいてIFTの性能限界を克服することが示されました。

English

Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.

推論はいつ重要か？モデル性能に対する推論の貢献に関する制御実験

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

要旨

Support