CodeTracer: 추적 가능한 에이전트 상태를 향하여

초록

코드 에이전트의 발전 속도는 빠르지만, 이를 디버깅하는 작업은 점점 더 어려워지고 있습니다. 프레임워크가 복잡한 작업에 대해 병렬 도구 호출과 다단계 워크플로를 조정하면서 에이전트의 상태 전이와 오류 전파를 관찰하기 힘들어졌기 때문입니다. 이러한 실행 과정에서 초기의 작은 실수가 에이전트를 비생산적인 루프에 빠뜨리거나 근본적인 오류로까지 이어지는 연쇄 반응을 일으켜, 에이전트가 언제, 왜 잘못된 길을 걷게 되었는지 파악하기 어려운 숨겨진 오류 사슬을 형성합니다. 기존의 에이전트 추적 분석 방법들은 단순한 상호작용에 집중하거나 소규모의 수동 검사에 의존하여, 실제 코딩 워크플로에 적용하기에는 확장성과 유용성이 제한됩니다. 본 논문에서는 변화하는 추출기로 이질적인 실행 산출물을 파싱하고, 지속적 메모리를 갖춘 계층적 추적 트리로 전체 상태 전이 이력을 재구성하며, 실패 시작점을 정확히 찾아내고 그 하류 사슬을 파악하는 실패 개시 지점 분석을 수행하는 추적 아키텍처인 CodeTracer를 제시합니다. 체계적인 평가를 위해, 우리는 다양한 코드 작업(예: 버그 수정, 리팩토링, 터미널 상호작용)에서 널리 사용되는 네 가지 코드 에이전트 프레임워크로 생성된 대량의 실행 궤적을 바탕으로 CodeTraceBench를 구축하였으며, 실패 지점 분석을 위해 단계 및 단계 하위 수준에서 감독 정보를 제공합니다. 실험 결과, CodeTracer는 직접 프롬프팅 및 경량 베이스라인을 크게 능가하며, 그 진단 신호를 재생성하면 동일한 예산 내에서 원래 실패했던 실행을 지속적으로 복구할 수 있음을 확인했습니다. 우리의 코드와 데이터는 공개되어 있습니다.

English

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

CodeTracer: 추적 가능한 에이전트 상태를 향하여

CodeTracer: Towards Traceable Agent States

초록

Support