自然场景下的对比归因：基于现实基准测试的大语言模型失败案例可解释性分析

摘要

可解释性工具正日益广泛用于分析大型语言模型（LLM）的故障成因，然而现有研究主要聚焦于短提示词或实验性场景，对其在常用基准测试中的行为模式探索不足。为填补这一空白，我们研究基于LRP的对比归因法，将其作为现实场景下LLM故障分析的实用工具。我们将故障分析定义为对比归因，通过追溯错误输出词元与正确替代项之间的逻辑差值，将其归因于输入词元及模型内部状态，并引入一种高效扩展方法以构建长上下文输入的跨层级归因图谱。基于该框架，我们跨多个基准测试展开系统性实证研究，比较不同数据集、模型规模及训练检查点下的归因模式。研究结果表明，尽管这种词元级对比归因在部分故障案例中能产生信息性信号，但其适用性并非普适，这既揭示了该方法在现实LLM故障分析中的实用性，也凸显了其局限性。相关代码已发布于：https://aka.ms/Debug-XAI。

English

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as contrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.

自然场景下的对比归因：基于现实基准测试的大语言模型失败案例可解释性分析

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

摘要

Support