超越最終答案：審計多智能體工業工作流程中的軌跡層級幻覺

摘要

大型語言模型（LLM）正逐漸被部署為能夠進行推理、使用工具並執行多步驟行動的自主代理。然而，大多數幻覺基準測試仍然只評估最終輸出，忽略了源自中間「思考-行動-觀察」步驟的失誤。我們提出 Trajel，這是一個用於審計多代理工業工作流程中軌跡層級幻覺的資料集與評估框架。Trajel 基於 AssetOpsBench 中專家標註的代理軌跡，引入了一種五類型幻覺分類法（事實性、指涉性、邏輯性、程序性與範圍性）。我們在子任務、軌跡與長語境層級對監督式檢測模型進行基準測試。結果顯示，最常見的失誤模式被現有基準測試所忽略；近半數含有幻覺的軌跡同時涉及多種類型；而具備高二元準確率的自動檢測器仍會將最細微的類型誤判。軌跡感知檢測顯著優於標準的事後驗證，顯示出分類驅動的評估對於更安全的代理部署而言至關重要。

English

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.