从方向到区域:基于局部几何的语言模型激活解构
From Directions to Regions: Decomposing Activations in Language Models via Local Geometry
February 2, 2026
作者: Or Shafran, Shaked Ronen, Omri Fahn, Shauli Ravfogel, Atticus Geiger, Mor Geva
cs.AI
摘要
语言模型中的激活解耦方法与激活空间中概念实现方式的几何假设紧密相关。现有方法通常寻找单一的全局方向,这种隐式假设线性可分的做法忽略了具有非线性或多维结构的复杂概念。本研究采用因子分析器混合模型(MFA)作为可扩展的无监督替代方案,将激活空间建模为具有局部协方差结构的高斯区域集合。MFA将激活分解为两个组合几何对象:激活空间中的区域质心,以及相对于质心的局部变异。我们针对Llama-3.1-8B和Gemma-2-2B训练了大规模MFA模型,证明其能捕捉激活空间中的复杂非线性结构。在定位与调控基准测试中,MFA不仅优于无监督基线方法,与有监督定位方法相比也具备竞争力,且其调控性能往往优于稀疏自编码器。这些发现共同表明,通过子空间表达的局部几何结构可作为可扩展概念发现和模型控制的有效分析单元,能够捕捉孤立方向所无法表征的复杂结构。
English
Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.