추론이 언제 중요한가? 모델 성능에 대한 추론의 기여도를 통제된 연구로 분석

초록

추론 능력을 갖춘 대형 언어 모델(LLMs)은 다양한 작업에서 최첨단 성능을 달성했습니다. 이러한 경험적 성공에도 불구하고, 추론이 효과적으로 작용하는 작업과 모델 규모, 그리고 그 훈련 및 추론 비용은 여전히 충분히 탐구되지 않았습니다. 본 연구에서는 합성 데이터 증류 프레임워크를 활용하여 대규모 지도 학습 연구를 수행합니다. 우리는 다양한 크기의 지시 미세 조정(IFT) 모델과 추론 모델을 수학 중심 및 일반 목적 작업에서 비교하며, 객관식과 주관식 형식 모두를 평가합니다. 분석 결과, 추론은 모델 성능을 지속적으로 향상시키며, 종종 훨씬 더 큰 IFT 시스템의 성능을 따라잡거나 능가하는 것으로 나타났습니다. 특히, IFT는 훈련 및 추론 비용 측면에서 파레토 최적을 유지하지만, 추론 모델은 모델 규모가 커질수록 점점 더 가치가 높아져, 추론 집약적이고 주관식 작업에서 IFT의 성능 한계를 극복합니다.

English

Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.

추론이 언제 중요한가? 모델 성능에 대한 추론의 기여도를 통제된 연구로 분석

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

초록

Support