並非所有語言模型特徵都是線性的。
Not All Language Model Features Are Linear
May 23, 2024
作者: Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark
cs.AI
摘要
最近的研究提出了線性表示假說:語言模型通過在激活空間中操作概念(“特徵”)的一維表示來執行計算。相反,我們探討一些語言模型表示是否可能天生是多維的。我們首先通過制定嚴格的不可約多維特徵定義來開始,該定義基於它們是否可以分解為獨立或不共現的低維特徵。受這些定義的啟發,我們設計了一種可擴展的方法,該方法使用稀疏自編碼器自動發現GPT-2和Mistral 7B中的多維特徵。這些自動發現的特徵包括引人注目的可解釋示例,例如代表一周的日子和一年的月份的圓形特徵。我們識別了在這些確切圓形特徵用於解決涉及一周的模數算術和一年的月份的計算問題的任務。最後,我們通過對Mistral 7B和Llama 3 8B進行干預實驗,提供證據表明這些圓形特徵確實是這些任務中的計算基本單元,並且通過將這些任務的隱藏狀態分解為可解釋的組件,發現進一步的圓形表示。
English
Recent work has proposed the linear representation hypothesis: that language
models perform computation by manipulating one-dimensional representations of
concepts ("features") in activation space. In contrast, we explore whether some
language model representations may be inherently multi-dimensional. We begin by
developing a rigorous definition of irreducible multi-dimensional features
based on whether they can be decomposed into either independent or
non-co-occurring lower-dimensional features. Motivated by these definitions, we
design a scalable method that uses sparse autoencoders to automatically find
multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered
features include strikingly interpretable examples, e.g. circular features
representing days of the week and months of the year. We identify tasks where
these exact circles are used to solve computational problems involving modular
arithmetic in days of the week and months of the year. Finally, we provide
evidence that these circular features are indeed the fundamental unit of
computation in these tasks with intervention experiments on Mistral 7B and
Llama 3 8B, and we find further circular representations by breaking down the
hidden states for these tasks into interpretable components.Summary
AI-Generated Summary