智能编码中测试时计算资源的扩展优化

摘要

测试时扩展已成为提升大型语言模型性能的有效手段。然而现有方法最适用于可直接比较、排序或优化的短篇幅有界输出。长周期编程智能体则违背了这一前提：每次尝试都会产生包含行动、观察、错误和部分进展的扩展轨迹。在此情境下，核心挑战不再是生成更多尝试，而是将先前经验转化为可供有效筛选和重用的表征形式。我们提出基于轨迹展开紧凑表征的智能编程测试时扩展框架，通过结构化摘要保存每次尝试的核心假设、进展和失败模式，同时剔除低信息量的轨迹细节。这种表征支持两种互补的推理时扩展模式：针对并行扩展，我们提出递归锦标赛投票法，通过小组比较递归筛选轨迹摘要集合；针对序列扩展，我们使并行蒸馏优化法适配智能体场景，将新尝试建立在既往尝试的蒸馏摘要基础上。该方法在SWE-Bench Verified和Terminal-Bench v2.0基准测试中持续提升前沿编程智能体性能：例如Claude-4.5-Opus在SWE-Bench Verified（mini-SWE-agent）上从70.9%提升至77.6%，在Terminal-Bench v2.0（Terminus 1）上从46.9%提升至59.1%。实验结果表明，长周期智能体的测试时扩展本质上是表征、筛选与重用三位一体的系统性问题。

English

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

智能编码中测试时计算资源的扩展优化

Scaling Test-Time Compute for Agentic Coding

摘要

Support