복잡한 작업을 위한 추론 시점 스케일링: 현재 상황과 미래 전망

초록

추론 시 스케일링은 단계별 문제 해결이 필요한 복잡한 문제에 대해 대규모 언어 모델(LLM)의 추론 능력을 향상시킬 수 있습니다. 생성된 스크래치패드를 길게 만드는 것이 수학적 과제에서는 효과적임이 입증되었지만, 이 접근법이 다른 과제에 미치는 광범위한 영향은 여전히 명확하지 않습니다. 본 연구에서는 수학 및 STEM 추론, 캘린더 계획, NP-난제, 내비게이션, 공간 추론 등 8가지 도전적인 과제와 9개의 최첨단 모델에 걸쳐 스케일링 방법의 이점과 한계를 조사합니다. 우리는 기존 모델(예: GPT-4o)과 추론 시 스케일링을 위해 미세 조정된 모델(예: o1)을 반복적인 모델 호출을 통해 평가 프로토콜로 비교합니다. 이러한 평가는 향상된 훈련이나 다중 모델 추론 시스템을 통해 각 모델의 하한 및 상한 성능 한계와 잠재적 성능 향상 가능성을 근사합니다. 우리의 광범위한 실증적 분석은 추론 시 스케일링의 이점이 과제에 따라 다양하며 문제 복잡성이 증가함에 따라 감소한다는 것을 보여줍니다. 또한, 이러한 도전적인 영역에서 단순히 더 많은 토큰을 사용하는 것이 항상 더 높은 정확도로 이어지지는 않습니다. 완벽한 검증기를 사용한 기존 모델의 다중 독립 실행 결과는 일부 과제에서 이러한 모델이 오늘날 가장 진보된 추론 모델의 평균 성능에 근접할 수 있음을 보여줍니다. 그러나 다른 과제에서는 매우 높은 스케일링 영역에서도 상당한 성능 격차가 남아 있습니다. 고무적으로, 모든 모델은 완벽한 검증기나 강력한 피드백으로 추론을 더욱 확장할 때 상당한 성능 향상을 보여주며, 이는 미래의 개선을 위한 충분한 잠재력을 시사합니다.

English

Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.

복잡한 작업을 위한 추론 시점 스케일링: 현재 상황과 미래 전망

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

초록

Support