AtP*: 대규모 언어 모델 행동을 구성 요소로 효율적이고 확장 가능하게 지역화하는 방법

초록

활성화 패칭(Activation Patching)은 모델 구성 요소에 대한 행동의 인과적 기여도를 직접 계산하는 방법이다. 그러나 이를 철저히 적용하려면 모델 구성 요소의 수에 비례하여 선형적으로 증가하는 비용이 발생하며, 이는 최신 대규모 언어 모델(LLMs)에서는 감당하기 어려울 정도로 비용이 많이 들 수 있다. 본 연구에서는 활성화 패칭의 빠른 경사 기반 근사법인 속성 패칭(Attribution Patching, AtP)을 조사하고, AtP의 두 가지 주요 실패 모드를 발견하였다. 이러한 실패 모드로 인해 상당한 수의 거짓 음성(false negatives)이 발생함을 확인하였다. 이에 AtP의 변형인 AtP*를 제안하며, 두 가지 변경 사항을 통해 이러한 실패 모드를 해결하면서도 확장성을 유지하였다. 본 연구는 AtP 및 더 빠른 활성화 패칭을 위한 대체 방법들에 대한 첫 번째 체계적인 연구를 제시하며, AtP가 조사된 다른 모든 방법들을 크게 능가함을 보여준다. 또한 AtP*는 추가적으로 상당한 개선을 제공한다. 마지막으로, AtP* 추정치에서 남아 있을 수 있는 거짓 음성의 확률을 제한하는 방법을 제시한다.

English

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.

AtP*: 대규모 언어 모델 행동을 구성 요소로 효율적이고 확장 가능하게 지역화하는 방법

AtP*: An efficient and scalable method for localizing LLM behaviour to components

초록

Support