推理時期的複雜任務擴展：現狀與未來展望

摘要

推理時期的擴展能夠增強大型語言模型（LLMs）在需要逐步解決的複雜問題上的推理能力。雖然延長生成的草稿紙對於數學任務已被證明有效，但這種方法在其他任務上的廣泛影響仍不夠明確。在本研究中，我們探討了擴展方法在九個最先進模型和八個挑戰性任務中的優勢與限制，包括數學與STEM推理、日曆規劃、NP難題、導航以及空間推理。我們將傳統模型（如GPT-4o）與針對推理時期擴展進行微調的模型（如o1）進行比較，通過涉及重複模型調用的評估協議，這些調用可以是獨立進行，也可以是帶有反饋的順序進行。這些評估近似於每個模型的性能下限與上限，以及未來通過增強訓練或多模型推理系統可能實現的性能提升潛力。我們廣泛的實證分析顯示，推理時期擴展的優勢因任務而異，並隨著問題複雜度的增加而減弱。此外，在這些挑戰性領域中，僅僅使用更多的令牌並不一定意味著更高的準確性。使用完美驗證器的傳統模型在多個獨立運行中的結果表明，對於某些任務，這些模型能夠接近當今最先進推理模型的平均性能。然而，對於其他任務，即使在非常高的擴展範圍內，性能差距仍然顯著。令人鼓舞的是，所有模型在進一步使用完美驗證器或強力反饋進行推理擴展時，都展現出顯著的增益，這表明未來改進的潛力巨大。

English

Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.

推理時期的複雜任務擴展：現狀與未來展望

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

摘要

Support