CodeTracer:迈向可追踪的智能体状态
CodeTracer: Towards Traceable Agent States
April 13, 2026
作者: Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu
cs.AI
摘要
代码智能体正快速发展,但其调试难度与日俱增。当框架在复杂任务中编排并行工具调用与多阶段工作流时,智能体的状态转换和错误传播变得难以观测。在这些运行过程中,早期的失误可能使智能体陷入无效循环,甚至引发根本性错误,形成难以察觉的错误链,导致开发者无法及时判断智能体何时偏离轨道及其原因。现有智能体追踪分析要么局限于简单交互,要么依赖小规模人工检查,这限制了其在真实编程工作流中的可扩展性和实用性。我们提出CodeTracer——一种通过动态解析器解析异构运行产物、将完整状态转换历史重建为具有持久化内存的层级化追踪树,并执行故障起始点定位以精准确认故障源头及其下游链的追踪架构。为进行系统化评估,我们从四大主流代码智能体框架在多样化编程任务(如缺陷修复、代码重构、终端交互)上执行的大规模轨迹中构建了CodeTraceBench数据集,该数据集包含阶段级和步骤级的故障定位监督信号。实验表明,CodeTracer显著优于直接提示法和轻量级基线方法,且在其诊断信号的重放过程中,能在匹配预算下持续恢复原本失败的运行。我们的代码与数据已公开。
English
Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.