희소 오토인코더가 언어 모델에서 고도로 해석 가능한 특성을 발견하다

초록

신경망의 내부 작동을 더 잘 이해하는 데 방해가 되는 요소 중 하나는 다의성(polysemanticity)입니다. 다의성은 뉴런이 여러 의미적으로 구별되는 맥락에서 활성화되는 것처럼 보이는 현상을 말합니다. 다의성은 신경망이 내부적으로 수행하는 작업에 대해 간결하고 인간이 이해할 수 있는 설명을 찾는 것을 방해합니다. 다의성의 한 가지 가설적 원인은 중첩(superposition)입니다. 중첩은 신경망이 개별 뉴런에 특성을 할당하는 대신, 활성화 공간에서 과완전한(overcomplete) 방향 집합에 특성을 할당함으로써 자신이 가진 뉴런 수보다 더 많은 특성을 표현하는 현상을 말합니다. 본 연구에서는 언어 모델의 내부 활성화를 재구성하기 위해 희소 오토인코더(sparse autoencoders)를 사용하여 이러한 방향을 식별하려고 합니다. 이러한 오토인코더는 대안적인 접근법으로 식별된 방향보다 더 해석 가능하고 단의적(monosemantic)인 희소 활성화 특성 집합을 학습하며, 여기서 해석 가능성은 자동화된 방법으로 측정됩니다. 이러한 특성을 제거(ablating)함으로써 예를 들어 대명사 예측과 같은 기능을 제거하면서도 이전 기술보다 모델 동작을 덜 방해하는 정밀한 모델 편집이 가능합니다. 이 연구는 확장 가능한 비지도 방법을 사용하여 언어 모델에서 중첩을 해결할 수 있음을 보여줍니다. 우리의 방법은 향후 기계적 해석 가능성(mechanistic interpretability) 연구의 기초가 될 수 있으며, 이를 통해 더 큰 모델 투명성과 조종 가능성(steerability)을 가능하게 할 것으로 기대합니다.

English

One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Ablating these features enables precise model editing, for example, by removing capabilities such as pronoun prediction, while disrupting model behaviour less than prior techniques. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

희소 오토인코더가 언어 모델에서 고도로 해석 가능한 특성을 발견하다

Sparse Autoencoders Find Highly Interpretable Features in Language Models

초록

Support