スパースオートエンコーダが言語モデルにおける高度に解釈可能な特徴を発見

要旨

ニューラルネットワークの内部をより深く理解する上での障壁の一つは、多義性（polysemanticity）である。これは、ニューロンが複数の意味的に異なる文脈で活性化するように見える現象を指す。多義性は、ニューラルネットワークが内部で何を行っているかについて、簡潔で人間が理解可能な説明を特定することを妨げる。多義性の原因として仮説の一つに挙げられているのが、重ね合わせ（superposition）である。これは、ニューラルネットワークが、個々のニューロンではなく、活性化空間における過完備な方向セットに特徴を割り当てることで、ニューロンの数以上の特徴を表現する現象である。本研究では、スパースオートエンコーダを使用して言語モデルの内部活性化を再構築し、これらの方向を特定しようと試みた。これらのオートエンコーダは、他の手法で特定された方向よりも解釈可能で単義的（monosemantic）な、疎に活性化する特徴セットを学習する。ここで、解釈可能性は自動化された方法によって測定される。これらの特徴を除去することで、例えば代名詞予測のような能力を削除するなど、モデルの編集を精密に行うことが可能であり、従来の手法よりもモデルの動作を乱すことなく実現できる。この研究は、スケーラブルで教師なしの方法を用いて言語モデルにおける重ね合わせを解決できる可能性を示している。本手法は、将来のメカニズム的解釈可能性研究の基盤として機能し、モデルの透明性と制御性を高めることに貢献することが期待される。

English

One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Ablating these features enables precise model editing, for example, by removing capabilities such as pronoun prediction, while disrupting model behaviour less than prior techniques. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

スパースオートエンコーダが言語モデルにおける高度に解釈可能な特徴を発見

Sparse Autoencoders Find Highly Interpretable Features in Language Models

要旨

Support