모든 언어 모델의 특성이 선형적인 것은 아니다

초록

최근 연구에서는 언어 모델이 활성화 공간 내에서 개념("특징")의 일차원적 표현을 조작함으로써 계산을 수행한다는 선형 표현 가설을 제안했습니다. 이와 대조적으로, 우리는 일부 언어 모델 표현이 본질적으로 다차원적일 가능성을 탐구합니다. 우리는 먼저, 이러한 표현이 독립적이거나 동시에 발생하지 않는 저차원 특징으로 분해될 수 있는지 여부를 기반으로 환원 불가능한 다차원 특징에 대한 엄격한 정의를 개발합니다. 이러한 정의에 동기를 부여받아, 우리는 GPT-2와 Mistral 7B에서 다차원 특징을 자동으로 찾기 위해 희소 오토인코더를 사용하는 확장 가능한 방법을 설계합니다. 이 자동 발견된 특징들은 매우 해석 가능한 예시들을 포함하며, 예를 들어 요일과 월을 나타내는 원형 특징들이 있습니다. 우리는 이러한 정확한 원형 특징들이 요일과 월과 관련된 모듈러 산술 문제를 해결하는 데 사용되는 작업들을 식별합니다. 마지막으로, 우리는 Mistral 7B와 Llama 3 8B에 대한 개입 실험을 통해 이러한 원형 특징들이 실제로 이러한 작업에서 계산의 기본 단위임을 증거로 제시하고, 이러한 작업에 대한 은닉 상태를 해석 가능한 구성 요소로 분해함으로써 추가적인 원형 표현들을 발견합니다.

English

Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.

모든 언어 모델의 특성이 선형적인 것은 아니다

Not All Language Model Features Are Linear

초록

Support