エージェント型コーディングにおけるテスト時間計算リソースのスケーリング

要旨

テストタイムスケーリングは大規模言語モデルの性能向上における有力な手法となっている。しかし、既存の手法は比較・ランキング・精製が直接行える短く境界のある出力に最適化されている。長期的なコーディングエージェントはこの前提に反する：各試行は、エージェントが実行した一連の行動、観察、エラー、部分的な進展からなる軌跡を生成する。この設定において、主たる課題は試行回数を増やすことではなく、過去の経験を効果的に選択・再利用可能な形で表現することである。本論文では、ロールアウト軌跡のコンパクトな表現に基づくエージェント型コーディングのためのテストタイムスケーリングフレームワークを提案する。本フレームワークは各ロールアウトを構造化要約に変換し、重要な仮説、進捗、失敗モードを保持しながら低信号のトレース詳細を除去する。この表現により、推論時スケーリングの二つの相補的な形式が可能となる。並列スケーリングにおいては、小グループ比較を通じてロールアウト要約の母集団を再帰的に絞り込む再帰的トーナメント投票（RTV）を導入する。逐次スケーリングにおいては、過去の試行から蒸留された要約を条件として新たなロールアウトを生成するようにParallel-Distill-Refine（PDR）をエージェント設定に適合させる。本手法は、SWE-Bench VerifiedおよびTerminal-Bench v2.0において、先進的なコーディングエージェントの性能を一貫して向上させる。例えば、本手法を用いることで、Claude-4.5-OpusはSWE-Bench Verified（mini-SWE-agent）で70.9%から77.6%に、Terminal-Bench v2.0（Terminus 1）で46.9%から59.1%に改善した。我々の結果は、長期的エージェントにおけるテストタイムスケーリングが本質的に表現・選択・再利用の問題であることを示唆している。

English

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

エージェント型コーディングにおけるテスト時間計算リソースのスケーリング

Scaling Test-Time Compute for Agentic Coding

要旨

Support