複雑なタスクにおける推論時のスケーリング：現状と今後の展望

要旨

推論時のスケーリングは、段階的な問題解決が有効な複雑な問題において、大規模言語モデル（LLM）の推論能力を向上させることができる。生成されるスクラッチパッドの長さを増やすことが数学タスクで効果的であることは証明されているが、このアプローチが他のタスクに与える広範な影響はまだ明確ではない。本研究では、9つの最先端モデルと8つの挑戦的なタスク（数学やSTEM推論、カレンダープランニング、NP困難問題、ナビゲーション、空間推論など）において、スケーリング手法の利点と限界を調査する。従来のモデル（例：GPT-4o）と推論時スケーリング用にファインチューニングされたモデル（例：o1）を比較し、独立した繰り返し呼び出しやフィードバックを伴う逐次呼び出しを含む評価プロトコルを用いる。これらの評価は、各モデルの性能の下限と上限、および将来の性能向上の可能性（強化されたトレーニングやマルチモデル推論システムを通じて）を近似する。我々の広範な実証分析により、推論時スケーリングの利点はタスクによって異なり、問題の複雑さが増すにつれて減少することが明らかになった。さらに、これらの挑戦的な領域では、単にトークン数を増やすことが必ずしも精度の向上につながるわけではない。従来のモデルを完璧な検証器と共に複数回実行した結果、一部のタスクでは、これらのモデルが今日の最も先進的な推論モデルの平均性能に近い性能を達成できることが示された。しかし、他のタスクでは、非常に高いスケーリング領域においても、依然として大きな性能差が残っている。励みになることに、すべてのモデルは、完璧な検証器や強力なフィードバックを用いて推論をさらにスケールアップすることで、大幅な性能向上を示しており、将来の改善の余地が十分にあることが示唆されている。

English

Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.

複雑なタスクにおける推論時のスケーリング：現状と今後の展望

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

要旨

Support