TRAIL:追踪推理與智能問題定位
TRAIL: Trace Reasoning and Agentic Issue Localization
May 13, 2025
作者: Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian
cs.AI
摘要
隨著代理工作流程在各個領域的日益普及,迫切需要一種可擴展且系統化的方法來評估這些系統生成的複雜軌跡。目前的評估方法依賴於人工、特定領域的專家對冗長工作流程軌跡進行分析——這種方法無法應對代理輸出日益增長的複雜性和規模。在這些情境下,錯誤分析因外部工具輸出與語言模型推理的交互作用而變得更加複雜,使其比傳統的軟件調試更具挑戰性。在本研究中,我們(1)闡述了對代理工作流程軌跡進行穩健且動態評估的必要性,(2)引入了一個針對代理系統中常見錯誤類型的正式分類法,以及(3)基於該分類法並結合既有的代理基準,提出了一組包含148條人工註釋軌跡的數據集(TRAIL)。為了確保生態效度,我們從單代理和多代理系統中精選了軌跡,聚焦於軟件工程和開放世界信息檢索等實際應用場景。我們的評估結果顯示,現代長上下文LLM在軌跡調試方面表現不佳,其中表現最佳的Gemini-2.5-pro模型在TRAIL上的得分僅為11%。我們公開了數據集和代碼,以支持和加速未來在代理工作流程可擴展評估方面的研究。
English
The increasing adoption of agentic workflows across diverse domains brings a
critical need to scalably and systematically evaluate the complex traces these
systems generate. Current evaluation methods depend on manual, domain-specific
human analysis of lengthy workflow traces - an approach that does not scale
with the growing complexity and volume of agentic outputs. Error analysis in
these settings is further complicated by the interplay of external tool outputs
and language model reasoning, making it more challenging than traditional
software debugging. In this work, we (1) articulate the need for robust and
dynamic evaluation methods for agentic workflow traces, (2) introduce a formal
taxonomy of error types encountered in agentic systems, and (3) present a set
of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and
grounded in established agentic benchmarks. To ensure ecological validity, we
curate traces from both single and multi-agent systems, focusing on real-world
applications such as software engineering and open-world information retrieval.
Our evaluations reveal that modern long context LLMs perform poorly at trace
debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our
dataset and code are made publicly available to support and accelerate future
research in scalable evaluation for agentic workflows.Summary
AI-Generated Summary