CodeTracer:邁向可追蹤的智能體狀態
CodeTracer: Towards Traceable Agent States
April 13, 2026
作者: Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu
cs.AI
摘要
程式碼代理程式正快速發展,但除錯難度也日益增加。當框架在複雜任務上協調平行工具呼叫與多階段工作流程時,代理程式的狀態轉換與錯誤傳播變得難以觀測。在這些執行過程中,早期的失誤可能使代理程式陷入無效循環,甚至引發根本性錯誤,形成隱蔽的錯誤鏈,導致難以判斷代理程式何時偏離正軌及其根本原因。現有的代理程式追蹤分析要么側重簡單互動,要么依賴小規模人工檢查,這限制了其在真實程式碼工作流程中的可擴展性與實用性。我們提出CodeTracer——一種透過動態提取器解析異構執行產物、將完整狀態轉換歷史重建為具持久化記憶的階層式追蹤樹,並執行故障起始點定位以精準識別故障根源及其下游鏈的追蹤架構。為實現系統化評估,我們從四大常用程式碼代理框架在多元程式任務(如錯誤修復、重構與終端互動)上執行的大量軌跡中構建了CodeTraceBench,並提供階段級與步驟級的故障定位監督。實驗表明,CodeTracer顯著優於直接提示法與輕量級基準方法,且其診斷信號的重播能在匹配資源下持續恢復原本失敗的執行。我們的程式碼與資料均已開源。
English
Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.