并非所有的语言模型特征都是线性的。

摘要

最近的研究提出了线性表示假设：即语言模型通过在激活空间中操作概念（“特征”）的一维表示来执行计算。相反，我们探讨了一些语言模型表示是否可能固有地是多维的。我们首先制定了对不可约多维特征的严格定义，该定义基于这些特征是否可以分解为独立或不共现的低维特征。受到这些定义的启发，我们设计了一种可扩展的方法，使用稀疏自动编码器自动发现GPT-2和Mistral 7B中的多维特征。这些自动发现的特征包括引人注目的可解释示例，例如代表星期几和每年月份的圆形特征。我们确定了在这些确切圆形特征用于解决涉及星期几和每年月份的模运算计算问题的任务。最后，我们通过对Mistral 7B和Llama 3 8B进行干预实验，提供证据表明这些圆形特征确实是这些任务中的计算基本单元，并通过将这些任务的隐藏状态分解为可解释的组件来发现进一步的圆形表示。

English

Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.

并非所有的语言模型特征都是线性的。

Not All Language Model Features Are Linear

摘要

Support