九头蛇效应：语言模型计算中的自发修复现象

摘要

我们使用因果分析调查语言模型计算的内部结构，并展示了两种模式：(1) 一种自适应计算形式，其中对语言模型的一个注意力层进行消融会导致另一层进行补偿（我们称之为九头蛇效应），以及(2) 晚期MLP层的平衡功能，用于下调最大似然令牌。我们的消融研究表明，语言模型层通常相对松散耦合（对一个层的消融仅影响少量下游层）。令人惊讶的是，即使在没有任何形式的辍学训练的语言模型中，这些效应也会发生。我们分析了这些效应在事实回忆背景下的情况，并考虑它们对语言模型中的电路级归因的影响。

English

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

九头蛇效应：语言模型计算中的自发修复现象

The Hydra Effect: Emergent Self-repair in Language Model Computations

摘要

Support