重访LRP：位置归因作为Transformer可解释性的缺失要素

摘要

开发有效的Transformer可解释性工具是深度学习研究中的一项关键任务。在这一领域中，层间相关性传播（Layer-wise Relevance Propagation, LRP）是最具前景的方法之一，它通过基于预定义规则重新分配激活值，将相关性分数从网络反向传播至输入空间。然而，现有的基于LRP的Transformer可解释性方法完全忽视了Transformer架构中的一个关键组件：位置编码（Positional Encoding, PE），这导致守恒性质的破坏，以及一种重要且独特的相关性类型的丢失，这种相关性同样与结构和位置特征相关联。为解决这一局限，我们将Transformer可解释性的输入空间重新定义为位置-标记对的集合。这使得我们能够提出专门的理论基础LRP规则，旨在跨多种位置编码方法（包括旋转编码、可学习编码和绝对编码）传播归因。通过大量实验，包括微调分类器和零样本基础模型（如LLaMA 3），我们证明了该方法在视觉和自然语言处理可解释性任务中显著优于现有技术。我们的代码已公开提供。

English

The development of effective explainability tools for Transformers is a crucial pursuit in deep learning research. One of the most promising approaches in this domain is Layer-wise Relevance Propagation (LRP), which propagates relevance scores backward through the network to the input space by redistributing activation values based on predefined rules. However, existing LRP-based methods for Transformer explainability entirely overlook a critical component of the Transformer architecture: its positional encoding (PE), resulting in violation of the conservation property, and the loss of an important and unique type of relevance, which is also associated with structural and positional features. To address this limitation, we reformulate the input space for Transformer explainability as a set of position-token pairs. This allows us to propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods, including Rotary, Learnable, and Absolute PE. Extensive experiments with both fine-tuned classifiers and zero-shot foundation models, such as LLaMA 3, demonstrate that our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks. Our code is publicly available.

重访LRP：位置归因作为Transformer可解释性的缺失要素

Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability

摘要

Support