CodeTracer: 追跡可能なエージェント状態を目指して

要旨

コードエージェントは急速に進化しているが、そのデバッグはますます困難になりつつある。フレームワークが複雑なタスクにおいて並列的なツール呼び出しや多段階のワークフローを調整するため、エージェントの状態遷移やエラー伝播を観察することが難しくなっている。こうした実行過程では、初期段階での誤りがエージェントを非生産的なループに閉じ込めたり、根本的なエラーへ連鎖したりすることがあり、隠れたエラー連鎖を形成する。これにより、エージェントがいつ軌道から外れ、なぜ外れたのかを判断することが困難になる。既存のエージェント追跡分析は、単純な相互作用に焦点を当てるか、小規模な手動検査に依存しており、実際のコーディングワークフローに対する拡張性と有用性が限られている。本研究ではCodeTracerを提案する。これは、進化する抽出器を通じて異種の実行成果物を解析し、永続的メモリを備えた階層的なトレース木として完全な状態遷移履歴を再構築し、失敗開始点の局所化を行って失敗の起源とその下流連鎖を特定するトレースアーキテクチャである。系統的な評価を可能にするため、我々は多様なコードタスク（バグ修正、リファクタリング、ターミナル操作など）において、4つの広く使用されているコードエージェントフレームワークによって生成された大量の実行軌跡からCodeTraceBenchを構築した。これには、失敗局所化のための段階レベルとステップレベルの両方での教師データが含まれる。実験の結果、CodeTracerが直接プロンプティングや軽量ベースライン手法を大幅に上回ること、またその診断信号を再生することで、同等のリソース条件下で原本失敗した実行を一貫して回復できることが示された。コードとデータは公開されている。

English

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

CodeTracer: 追跡可能なエージェント状態を目指して

CodeTracer: Towards Traceable Agent States

要旨

Support