すべての言語モデルの特徴が線形であるわけではない

要旨

最近の研究では、線形表現仮説が提唱されている。これは、言語モデルが活性化空間における概念（「特徴量」）の一次元表現を操作することで計算を実行するという仮説である。これに対して、我々は言語モデルの表現の中には本質的に多次元なものがあるかどうかを探求する。まず、独立したまたは共起しない低次元特徴量に分解可能かどうかに基づいて、還元不可能な多次元特徴量の厳密な定義を開発する。これらの定義に動機付けられ、GPT-2とMistral 7Bにおいて多次元特徴量を自動的に発見するためのスケーラブルな方法を、スパースオートエンコーダを用いて設計する。これらの自動発見された特徴量には、驚くほど解釈可能な例が含まれており、例えば曜日や月を表す円形の特徴量などがある。我々は、これらの正確な円形特徴量が、曜日や月に関するモジュラー演算を含む計算問題を解決するために使用されるタスクを特定する。最後に、Mistral 7BとLlama 3 8Bにおける介入実験を通じて、これらの円形特徴量が実際にこれらのタスクにおける計算の基本単位であることを示す証拠を提供し、これらのタスクの隠れ状態を解釈可能な成分に分解することで、さらに円形表現を見つける。

English

Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.

すべての言語モデルの特徴が線形であるわけではない

Not All Language Model Features Are Linear

要旨

Support