AtP*：一种将LLM行为定位到组件的高效可扩展方法

摘要

激活修补（Activation Patching）是一种直接计算行为因果归因于模型组件的方法。然而，要全面应用此方法需要进行一次扫描，成本随模型组件数量线性增长，这对于最先进的大型语言模型（LLMs）来说可能成本过高。我们研究了归因修补（Attribution Patching，AtP），这是一种快速基于梯度的激活修补近似方法，并发现了两类AtP的失效模式，导致显著的假阴性。我们提出了AtP的变体称为AtP*，通过两项改进来解决这些失效模式，同时保持可扩展性。我们首次系统研究了AtP及其他更快速激活修补方法，并展示了AtP明显优于所有其他研究方法，而AtP*则提供了进一步显著的改进。最后，我们提供了一种方法来限制AtP*估计的剩余假阴性的概率。

English

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.

AtP*：一种将LLM行为定位到组件的高效可扩展方法

AtP*: An efficient and scalable method for localizing LLM behaviour to components

摘要

Support