並非所有語言模型特徵都是線性的。

摘要

最近的研究提出了線性表示假說：語言模型通過在激活空間中操作概念（“特徵”）的一維表示來執行計算。相反，我們探討一些語言模型表示是否可能天生是多維的。我們首先通過制定嚴格的不可約多維特徵定義來開始，該定義基於它們是否可以分解為獨立或不共現的低維特徵。受這些定義的啟發，我們設計了一種可擴展的方法，該方法使用稀疏自編碼器自動發現GPT-2和Mistral 7B中的多維特徵。這些自動發現的特徵包括引人注目的可解釋示例，例如代表一周的日子和一年的月份的圓形特徵。我們識別了在這些確切圓形特徵用於解決涉及一周的模數算術和一年的月份的計算問題的任務。最後，我們通過對Mistral 7B和Llama 3 8B進行干預實驗，提供證據表明這些圓形特徵確實是這些任務中的計算基本單元，並且通過將這些任務的隱藏狀態分解為可解釋的組件，發現進一步的圓形表示。

English

Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.

並非所有語言模型特徵都是線性的。

Not All Language Model Features Are Linear

摘要

Support