대규모 해석 가능성: Alpaca 모델의 인과적 메커니즘 식별

초록

대규모 범용 언어 모델에 대한 인간이 이해 가능한 설명을 얻는 것은 AI 안전을 위한 시급한 목표입니다. 그러나 해석 가능성 방법이 모델 행동의 근본적인 인과적 역학에 충실하고, 보이지 않는 입력에 대해 견고하게 일반화할 수 있는 것 역시 중요합니다. 분산 정렬 탐색(Distributed Alignment Search, DAS)은 인과적 추상화 이론에 기반을 둔 강력한 경사 하강법으로, 특정 작업에 맞게 미세 조정된 소규모 딥러닝 모델과 해석 가능한 기호 알고리즘 간의 완벽한 정렬을 발견했습니다. 본 논문에서는 남아 있는 무차별 대입 탐색 단계를 학습된 매개변수로 대체함으로써 DAS를 크게 확장합니다. 이를 통해 대규모 언어 모델이 지시를 따르는 동안 해석 가능한 인과적 구조를 효율적으로 탐색할 수 있습니다. 우리는 DAS를 Alpaca 모델(70억 개의 매개변수)에 적용했으며, 이 모델은 기본적으로 간단한 수치 추론 문제를 해결합니다. DAS를 통해 우리는 Alpaca가 두 개의 해석 가능한 부울 변수를 가진 인과 모델을 구현함으로써 이를 수행한다는 것을 발견했습니다. 또한, 이러한 변수와 신경망 표현 간의 정렬이 입력과 지시의 변화에 대해 견고함을 확인했습니다. 이러한 발견은 우리가 가장 크고 널리 배포된 언어 모델의 내부 작동을 깊이 이해하기 위한 첫걸음입니다.

English

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

대규모 해석 가능성: Alpaca 모델의 인과적 메커니즘 식별

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

초록

Support