提升智能体编码的测试时计算规模

摘要

测试时扩展已成为提升大型语言模型性能的有效手段。然而现有方法最适用于可直接比较、排序或优化的短篇幅有界输出。长周期编程智能体违背了这一前提：每次尝试都会产生包含动作序列、观察结果、错误信息及部分进展的扩展轨迹。在此情境下，核心挑战不再是生成更多尝试，而是将过往经验转化为可有效筛选重用的表征形式。我们提出基于轨迹展开紧凑表征的智能编程测试时扩展框架，通过结构化摘要保存每次尝试的关键假设、进展阶段与故障模式，同时过滤低价值细节。该表征支持两种互补的推理时扩展模式：针对并行扩展，我们提出递归锦标赛投票法，通过小组比较递归筛选轨迹摘要集合；针对序列扩展，我们使并行蒸馏优化法适配智能体场景，将新尝试建立在既往摘要的蒸馏结果之上。该方法在SWE-Bench Verified和Terminal-Bench v2.0基准测试中持续提升前沿编程智能体性能，例如Claude-4.5-Opus在SWE-Bench Verified（mini-SWE-agent）上从70.9%提升至77.6%，在Terminal-Bench v2.0（Terminus 1）上从46.9%提升至59.1%。实验结果表明，长周期智能体的测试时扩展本质上是表征、筛选与重用的系统化问题。

English

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.