ChatPaper.aiChatPaper

遞減回報的幻象:衡量大型語言模型中的長時程執行能力

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

September 11, 2025
作者: Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping
cs.AI

摘要

大型語言模型(LLMs)的持續擴展是否會帶來收益遞減?現實世界的價值往往源自於代理能夠完成任務的長度。我們從觀察一個簡單卻反直覺的事實開始這項工作:單步準確性的邊際增益可以複合成模型能夠成功完成任務長度的指數級提升。接著,我們論證當簡單任務被延長時,LLMs的失敗源於執行中的錯誤,而非推理能力的不足。我們提出通過明確提供解決長時程任務所需的知識和計劃來隔離執行能力。我們發現,即使小型模型在單步準確性上達到100%,更大的模型也能正確執行顯著更多的步驟。我們觀察到,隨著步驟數量的增加,模型的每步準確性會下降。這不僅僅是由於長上下文限制——有趣的是,我們觀察到一種自我條件效應——當上下文中包含先前步驟的錯誤時,模型更容易犯錯。僅僅通過擴展模型規模並不能減少這種自我條件效應。相比之下,近期的思維模型不會自我條件,並且能夠在單一步驟中執行更長的任務。我們最後通過基準測試前沿思維模型在單一步驟中能夠執行的任務長度來總結。總體而言,通過聚焦於執行能力,我們希望調和關於LLMs如何能夠解決複雜推理問題卻在簡單任務被延長時失敗的辯論,並強調擴展模型規模和序列測試時計算對於長時程任務的巨大益處。
English
Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.
PDF344January 19, 2026