AtP*: LLMの振る舞いをコンポーネントに局所化するための効率的かつスケーラブルな手法

要旨

活性化パッチング（Activation Patching）は、モデルの構成要素に対する行動の因果的帰属を直接計算する手法である。しかし、これを網羅的に適用するには、モデル構成要素の数に比例してコストが増加するスイープが必要であり、最新の大規模言語モデル（LLMs）では実用的でない場合がある。本研究では、活性化パッチングの高速な勾配ベースの近似手法である帰属パッチング（Attribution Patching, AtP）を調査し、AtPが重大な偽陰性を引き起こす2つの失敗モードを特定した。これらの失敗モードに対処しつつスケーラビリティを維持するため、AtPの変種であるAtP*を提案する。本論文では、AtPおよび高速な活性化パッチングのための代替手法に関する初の体系的な研究を提示し、AtPが他の調査対象手法を大幅に上回り、AtP*がさらに大幅な改善をもたらすことを示す。最後に、AtP*の推定値における残存偽陰性の確率を限定する手法を提供する。

English

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.

AtP*: LLMの振る舞いをコンポーネントに局所化するための効率的かつスケーラブルな手法

AtP*: An efficient and scalable method for localizing LLM behaviour to components

要旨

Support