TRAIL：追踪推理與智能問題定位

摘要

隨著代理工作流程在各個領域的日益普及，迫切需要一種可擴展且系統化的方法來評估這些系統生成的複雜軌跡。目前的評估方法依賴於人工、特定領域的專家對冗長工作流程軌跡進行分析——這種方法無法應對代理輸出日益增長的複雜性和規模。在這些情境下，錯誤分析因外部工具輸出與語言模型推理的交互作用而變得更加複雜，使其比傳統的軟件調試更具挑戰性。在本研究中，我們（1）闡述了對代理工作流程軌跡進行穩健且動態評估的必要性，（2）引入了一個針對代理系統中常見錯誤類型的正式分類法，以及（3）基於該分類法並結合既有的代理基準，提出了一組包含148條人工註釋軌跡的數據集（TRAIL）。為了確保生態效度，我們從單代理和多代理系統中精選了軌跡，聚焦於軟件工程和開放世界信息檢索等實際應用場景。我們的評估結果顯示，現代長上下文LLM在軌跡調試方面表現不佳，其中表現最佳的Gemini-2.5-pro模型在TRAIL上的得分僅為11%。我們公開了數據集和代碼，以支持和加速未來在代理工作流程可擴展評估方面的研究。

English

The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.

TRAIL：追踪推理與智能問題定位

TRAIL: Trace Reasoning and Agentic Issue Localization

摘要

Support