특징에서 행동으로: 전통적 AI와 에이전트 AI 시스템의 설명 가능성

초록

지난 10년 동안 설명 가능한 AI는 주로 고정된 의사결정 구조 하에서 입력과 출력의 관계를 사후적으로 설명하는 개별 모델 예측 해석에 집중해왔습니다. 대규모 언어 모델(LLM)의 최근 발전은 다단계 경로를 통해 행동이 전개되는 자율적 AI 시스템을 가능하게 했습니다. 이러한 환경에서는 성공과 실패가 단일 출력이 아닌 일련의 의사결정 순서에 의해 결정됩니다. 유용하지만 정적 예측을 위해 설계된 설명 접근법이 시간에 따라 행동이 나타나는 자율적 환경에 어떻게 적용되는지는 여전히 불분명합니다. 본 연구에서는 속성 기반 설명과 경로 기반 진단을 두 환경에서 비교함으로써 정적 설명 가능성과 자율적 설명 가능성 간의 차이를 해소합니다. 이러한 차이를 명확히 하기 위해 정적 분류 작업에서 사용되는 속성 기반 설명과 자율적 벤치마크(TAU-bench Airline 및 AssistantBench)에서 사용되는 경로 기반 진단을 실증적으로 비교합니다. 연구 결과에 따르면 속성 방법이 정적 환경에서는 안정적인 특징 순위를 달성하지만(Spearman ρ=0.86), 자율적 경로에서 실행 수준 실패를 진단하는 데는 신뢰성 있게 적용될 수 없습니다. 반면 자율적 환경을 위한 경로 기반 루브릭 평가는 일관되게 행동 고장을 특정하며, 상태 추적 불일치가 실패한 실행에서 2.7배 더 빈번하게 발생하고 성공 확률을 49% 감소시킨다는 것을 보여줍니다. 이러한 결과는 자율적 AI 행동을 평가하고 진단할 때 자율적 시스템을 위한 경로 수준 설명 가능성으로의 전환을 촉구합니다.

English

Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman ρ= 0.86), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7times more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

특징에서 행동으로: 전통적 AI와 에이전트 AI 시스템의 설명 가능성

From Features to Actions: Explainability in Traditional and Agentic AI Systems

초록

Support