從特徵到行動：傳統與能動型人工智慧系統的可解釋性

摘要

過去十年間，可解釋人工智慧的研究主要聚焦於解讀單一模型預測，透過事後歸因方法在固定決策結構下闡釋輸入與輸出的關聯。隨著大型語言模型的突破，具備自主行為能力的AI代理系統得以實現，其行為表現透過多步驟軌跡逐步展開。在此類情境中，成敗取決於決策序列而非單一輸出結果。雖然現有解釋方法具有一定效用，但針對靜態預測設計的解釋框架如何適用於行為隨時間演進的代理系統，仍存在明確的知識缺口。本研究透過比較基於屬性歸因的解釋方法與基於軌跡追蹤的診斷機制，在靜態與動態兩種情境下進行對照分析，從而銜接靜態可解釋性與代理系統可解釋性之間的斷層。為明確區分兩者，我們實證比較了靜態分類任務中使用的屬性歸因解釋法，與代理基準測試（TAU-bench Airline與AssistantBench）中採用的軌跡追蹤診斷法。研究結果顯示：屬性歸因法在靜態情境下能獲得穩定的特徵排序（斯皮爾曼等級相關係數ρ=0.86），但無法可靠應用於診斷代理軌跡中的執行層級失誤；反之，基於軌跡的評量準則能持續定位行為斷裂點，並揭露狀態追蹤不一致性在失敗案例中的發生頻率高出2.7倍，且使成功機率降低49%。這些發現促使我們在評估自主AI行為時，應朝向軌跡層級可解釋性進行典範轉移。資源： https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

English

Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman ρ= 0.86), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7times more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

從特徵到行動：傳統與能動型人工智慧系統的可解釋性

From Features to Actions: Explainability in Traditional and Agentic AI Systems

摘要

Support