하이드라 효과: 언어 모델 계산에서 나타나는 자기 복구 현상

초록

우리는 인과 분석을 통해 언어 모델 계산의 내부 구조를 조사하고 두 가지 주요 패턴을 발견했습니다: (1) 언어 모델의 하나의 어텐션 레이어를 제거했을 때 다른 레이어가 이를 보상하는 적응형 계산 형태(이를 '히드라 효과'라고 명명)와 (2) 최대 우도 토큰을 하향 조절하는 후기 MLP 레이어의 균형 조절 기능입니다. 우리의 제거 연구는 언어 모델 레이어들이 일반적으로 상대적으로 느슨하게 결합되어 있음을 보여줍니다(한 레이어의 제거는 소수의 하위 레이어에만 영향을 미침). 놀랍게도, 이러한 효과는 드롭아웃 없이 훈련된 언어 모델에서도 발생합니다. 우리는 이러한 효과를 사실 회상의 맥락에서 분석하고, 언어 모델의 회로 수준 속성에 대한 함의를 고려합니다.

English

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

하이드라 효과: 언어 모델 계산에서 나타나는 자기 복구 현상

The Hydra Effect: Emergent Self-repair in Language Model Computations

초록

Support