ChatPaper.aiChatPaper

递减回报的幻象:衡量大语言模型中的长程执行能力

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

September 11, 2025
作者: Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping
cs.AI

摘要

大规模语言模型(LLMs)的持续扩展是否会导致收益递减?实际价值往往源于智能体能够完成的任务长度。我们通过观察一个简单却反直觉的事实开启这项研究:单步准确率的边际提升可以复合为模型成功完成任务长度的指数级改进。随后,我们提出,当简单任务被延长时,LLMs的失败源于执行中的错误,而非推理能力的不足。我们建议通过明确提供解决长期任务所需的知识和计划,来隔离执行能力。我们发现,即使小型模型在单步任务上达到100%的准确率,更大的模型仍能正确执行显著更多的步骤。我们注意到,随着步骤数量的增加,模型的每步准确率会下降。这不仅仅是因为长上下文限制——有趣的是,我们观察到一种自我条件效应——当上下文中包含模型先前步骤的错误时,模型更有可能犯错。仅通过扩大模型规模并不能减少这种自我条件效应。相比之下,最新的思维模型不会自我条件化,并且还能在单步中执行更长的任务。最后,我们通过基准测试前沿思维模型在单步中能执行的任务长度来得出结论。总体而言,通过聚焦于执行能力,我们希望能调和关于LLMs如何能解决复杂推理问题却在简单任务延长时失败的争论,并强调扩大模型规模和顺序测试时计算对于长期任务的巨大益处。
English
Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.
PDF344January 19, 2026