実行するか否か：LLMベースのプログラム修復におけるコード実行の費用対効果の分析

要旨

プログラム修復のためのLLMベースのエージェントは、テストを反復的に実行してパッチを評価・改良する「生成・実行・修正」パラダイムに基づいて構築されることが増えている。この実行ベースのアプローチは、最先端システムにおける標準的な手法となっている。しかしながら、実行は時間とコストがかかる可能性がある一方、こうしたエージェントへのその影響は十分に調査されていない。本稿では、LLMベースのプログラム修復における実行行動に関して、二段階の実証研究を実施する。実行行動を大規模に特徴づけるため、まずSWE-benchリーダーボードに提出された7,745件のエージェントトレースを分析する。次に、200のSWE-benchインスタンスと3つのエージェント（Claude Code、Codex、オープンソースのOpenCode）に対して、4つの実行パラダイムのもとで3,000件のエンドツーエンドの修復試行を評価し、性能とコストの詳細な比較を可能にする。分析により、以下の3つの重要な知見が得られた。(1) コード実行は分析対象の全エージェントおよびモデルで利用されており、タスクあたり平均8.8回のテスト実行が行われている。実行行動はエージェントやモデルによって大きく異なり、頻度はタスクあたり2回から19回の範囲であり、後期の実行は前期の実行よりも一貫して高い成功率を示す。(2) 実行制限は修復成功率にほとんど影響を与えない。SOTAモデルを搭載した商用エージェントでは、実行禁止と無制限の間の解決率の差はわずか1.25パーセントポイントであり、統計的に有意ではない一方、実行禁止はトークンおよび実時間コストを大幅に節約する。(3) 実行の利点は一様ではなく、集中している。これらのパターンは、現在のエージェントが実行を無差別に適用しており、利益がほとんどないインスタンスでそのコストを支払っていることを示唆している。したがって、実行はデフォルトの機能としてではなく、明示的な費用便益トレードオフを伴うリソースとして扱われるべきである。

English

LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior at scale, we first analyze 7,745 agent traces from SWE-bench leaderboard submissions. Second, we evaluate 3,000 end-to-end repair attempts across 200 SWE-bench instances and three agents (Claude Code, Codex, and the open-source OpenCode) under four execution paradigms, which allows for a fine-grained comparison of performance and cost. Our analysis reveals three key observations: (1) Code execution is used across all agents and models analyzed, with an average of 8.8 test runs per task. Execution behavior varies substantially across agents and models, with frequency ranging from 2 to 19 per task, and late-stage executions consistently achieve higher success rates than early-stage ones. (2) Execution restrictions have little effect on repair success: on commercial agents with SOTA models the resolve-rate gap between Prohibited and Unrestricted is only 1.25 percentage points and not statistically significant, while Prohibited saves substantial token and wall-clock cost. (3) Execution benefit is concentrated rather than uniform. These patterns suggest that current agents apply execution indiscriminately, paying its cost on instances where it provides little benefit. Execution, therefore, should be treated as a resource with an explicit cost-benefit tradeoff, not a default capability.