최종 답변을 넘어서: 다중 에이전트 산업 워크플로에서 궤적 수준 환각 감사

초록

대규모 언어 모델(LLM)은 추론하고, 도구를 사용하며, 여러 단계에 걸쳐 행동하는 자율 에이전트로 점점 더 많이 배치되고 있다. 그러나 대부분의 환각 벤치마크는 여전히 최종 출력만 평가할 뿐, 중간 단계의 사고-행동-관찰(Thought-Action-Observation) 과정에서 발생하는 오류는 간과하고 있다. 본 논문에서는 다중 에이전트 산업 워크플로우에서 궤적 수준의 환각을 감사(audit)하기 위한 데이터셋이자 평가 프레임워크인 Trajel을 제시한다. Trajel은 AssetOpsBench의 전문가 주석 에이전트 추적(trace)을 기반으로 다섯 가지 유형의 환각 분류 체계(사실적, 참조적, 논리적, 절차적, 범위 기반)를 도입한다. 우리는 하위 작업, 궤적, 장문맥 수준에서 지도 학습 기반 탐지 모델을 평가한다. 실험 결과, 가장 흔한 오류 유형은 기존 벤치마크에서 놓치고 있으며, 환각 궤적의 절반 가까이는 여러 유형이 동시에 발생하며, 이진 정확도가 높은 자동 탐지기조차도 가장 미묘한 유형을 잘못 분류함을 보여준다. 궤적 인식 탐지(trajectory-aware detection)는 표준 사후 검증(post-hoc verification)보다 훨씬 우수한 성능을 보이며, 더 안전한 에이전트 배치를 위해 분류 체계에 기반한 평가가 필수적임을 시사한다.

English

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.