LRP 재고찰: 트랜스포머 설명 가능성을 위한 핵심 요소로서의 위치 기반 속성화

초록

트랜스포머(Transformer)를 위한 효과적인 설명 가능성 도구의 개발은 딥러닝 연구에서 중요한 과제입니다. 이 분야에서 가장 유망한 접근법 중 하나는 계층별 관련성 전파(Layer-wise Relevance Propagation, LRP)로, 이는 미리 정의된 규칙에 따라 활성화 값을 재분배하여 네트워크를 통해 입력 공간으로 관련성 점수를 역전파합니다. 그러나 트랜스포머 설명 가능성을 위한 기존의 LRP 기반 방법들은 트랜스포머 아키텍처의 중요한 구성 요소인 위치 인코딩(Positional Encoding, PE)을 완전히 간과하고 있어, 보존 속성을 위반하고 구조적 및 위치적 특징과 관련된 중요하고 독특한 유형의 관련성을 상실하게 됩니다. 이러한 한계를 해결하기 위해, 우리는 트랜스포머 설명 가능성을 위한 입력 공간을 위치-토큰 쌍의 집합으로 재구성합니다. 이를 통해 로터리(Rotary), 학습 가능한(Learnable), 절대적(Absolute) PE를 포함한 다양한 위치 인코딩 방법에 걸쳐 속성을 전파하도록 설계된 이론적으로 근거 있는 특수화된 LRP 규칙을 제안할 수 있습니다. LLaMA 3와 같은 미세 조정된 분류기와 제로샷 기반 모델을 사용한 광범위한 실험을 통해, 우리의 방법이 비전 및 NLP 설명 가능성 작업에서 최신 기술을 크게 능가함을 입증했습니다. 우리의 코드는 공개되어 있습니다.

English

The development of effective explainability tools for Transformers is a crucial pursuit in deep learning research. One of the most promising approaches in this domain is Layer-wise Relevance Propagation (LRP), which propagates relevance scores backward through the network to the input space by redistributing activation values based on predefined rules. However, existing LRP-based methods for Transformer explainability entirely overlook a critical component of the Transformer architecture: its positional encoding (PE), resulting in violation of the conservation property, and the loss of an important and unique type of relevance, which is also associated with structural and positional features. To address this limitation, we reformulate the input space for Transformer explainability as a set of position-token pairs. This allows us to propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods, including Rotary, Learnable, and Absolute PE. Extensive experiments with both fine-tuned classifiers and zero-shot foundation models, such as LLaMA 3, demonstrate that our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks. Our code is publicly available.

LRP 재고찰: 트랜스포머 설명 가능성을 위한 핵심 요소로서의 위치 기반 속성화

Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability

초록

Support