推理何時重要？推理對模型性能貢獻的對照研究

摘要

具備推理能力的大型語言模型（LLMs）在多種任務上已達到了最先進的表現。儘管其在實證上取得了成功，但推理能力在哪些任務和模型規模下變得有效，以及其訓練和推理成本，仍未被充分探索。在本研究中，我們依賴於一個合成數據蒸餾框架來進行大規模的監督學習研究。我們比較了不同規模的指令微調（IFT）模型和推理模型，在多種以數學為核心和通用任務上的表現，評估了包括多選題和開放式問題在內的多種格式。我們的分析顯示，推理能力持續提升模型表現，往往能匹配甚至超越顯著更大的IFT系統。值得注意的是，雖然IFT在訓練和推理成本上仍保持帕累托最優，但隨著模型規模的擴大，推理模型的價值日益凸顯，能夠在推理密集型和開放式任務上突破IFT的性能限制。

English

Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.

推理何時重要？推理對模型性能貢獻的對照研究

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

摘要

Support