TRAIL：追踪推理与智能问题定位

摘要

随着智能体工作流在各领域的广泛应用，如何可扩展且系统地评估这些系统生成的复杂轨迹已成为一个关键需求。当前的评估方法依赖于人工对冗长工作流轨迹进行领域特定的分析，这种方式难以应对智能体输出日益增长的复杂性和规模。在这些场景中，错误分析因外部工具输出与语言模型推理的交互而更加复杂，使其比传统软件调试更具挑战性。在本研究中，我们（1）阐述了开发稳健且动态的智能体工作流轨迹评估方法的必要性，（2）提出了智能体系统中遇到的错误类型的正式分类体系，并（3）基于该分类体系，结合成熟的智能体基准，构建了一个包含148条大规模人工标注轨迹的数据集（TRAIL）。为确保生态效度，我们从单智能体和多智能体系统中精选轨迹，重点关注软件工程和开放世界信息检索等实际应用场景。我们的评估显示，现代长上下文大语言模型在轨迹调试方面表现欠佳，表现最佳的Gemini-2.5-pro模型在TRAIL上的得分仅为11%。我们公开了数据集和代码，以支持和加速未来在智能体工作流可扩展评估方面的研究。

English

The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.

TRAIL：追踪推理与智能问题定位

TRAIL: Trace Reasoning and Agentic Issue Localization

摘要

Support