推理何时至关重要？关于推理对模型性能贡献的对照研究

摘要

具备推理能力的大型语言模型（LLMs）已在广泛任务中实现了顶尖性能。尽管其经验性成功显著，但推理能力在何种任务及模型规模下最为有效，以及其训练与推理成本，仍待深入探索。本研究依托于一个合成数据蒸馏框架，开展大规模监督学习实验。我们对比了不同规模的指令微调（IFT）模型与推理模型，在数学核心及通用任务上的表现，评估了包括多项选择与开放式问答在内的多种形式。分析表明，推理能力持续提升模型性能，常能匹敌甚至超越规模显著更大的IFT系统。值得注意的是，尽管IFT在训练与推理成本上保持帕累托最优，但随着模型规模扩大，推理模型的价值日益凸显，在推理密集型和开放式任务上突破了IFT的性能极限。

English

Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.

推理何时至关重要？关于推理对模型性能贡献的对照研究

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

摘要

Support