从特征到行动：传统与智能体AI系统的可解释性

摘要

过去十年间，可解释性人工智能主要聚焦于对单一模型预测的解读，通过事后归因方法在固定决策结构下建立输入与输出的关联。随着大语言模型（LLMs）的突破性进展，具备多步骤行为轨迹的智能体AI系统得以实现。在此类场景中，成败取决于决策序列而非单一输出。尽管现有解释方法具有一定实用性，但针对静态预测设计的解释方案如何适用于行为随时间演进的智能体场景仍不明确。本研究通过对比归因式解释与轨迹式诊断在两种场景下的表现，弥合了静态可解释性与智能体可解释性之间的鸿沟。为明确区分二者，我们实证比较了静态分类任务中的归因解释方法与智能体基准测试（TAU-bench Airline与AssistantBench）中的轨迹诊断方法。实验结果表明：归因方法在静态场景中能获得稳定的特征排序（斯皮尔曼ρ=0.86），但无法可靠诊断智能体轨迹中的执行层故障；而基于轨迹的评分框架则能持续定位行为故障点，并揭示状态追踪不一致现象在失败案例中的出现频率是成功案例的2.7倍，且使成功概率降低49%。这些发现表明，在评估和诊断自主AI行为时，亟需向面向智能体系统的轨迹级可解释性范式转变。资源链接： https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

English

Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman ρ= 0.86), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7times more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

从特征到行动：传统与智能体AI系统的可解释性

From Features to Actions: Explainability in Traditional and Agentic AI Systems

摘要

Support