九頭蛇效應:語言模型計算中的自癒現象
The Hydra Effect: Emergent Self-repair in Language Model Computations
July 28, 2023
作者: Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg
cs.AI
摘要
我們使用因果分析探討語言模型計算的內部結構,並展示兩種模式:(1) 一種適應性計算形式,其中對語言模型的一個注意力層進行消融將導致另一層進行補償(我們稱之為九頭蛇效應),以及 (2) 晚期 MLP 層的抗衡功能,用於降低最大似然標記。我們的消融研究表明,語言模型層通常相對鬆散耦合(對一個層進行消融僅影響少量下游層)。令人驚訝的是,即使在沒有任何形式的輸出層的語言模型訓練中,這些效應也會發生。我們分析這些效應在事實回憶的背景下,並考慮它們對語言模型中電路級歸因的影響。
English
We investigate the internal structure of language model computations using
causal analysis and demonstrate two motifs: (1) a form of adaptive computation
where ablations of one attention layer of a language model cause another layer
to compensate (which we term the Hydra effect) and (2) a counterbalancing
function of late MLP layers that act to downregulate the maximum-likelihood
token. Our ablation studies demonstrate that language model layers are
typically relatively loosely coupled (ablations to one layer only affect a
small number of downstream layers). Surprisingly, these effects occur even in
language models trained without any form of dropout. We analyse these effects
in the context of factual recall and consider their implications for
circuit-level attribution in language models.