基于角度-范数分解的激活引导几何解释

摘要

线性激活引导因其简单且经验有效而逐渐流行，成为控制语言模型行为的一种方式。近期，球形引导范式被提出以克服加法干预的局限性，其动机常基于隐含状态范数不携带概念相关信息的假设。在本项工作中，我们通过一项旨在分离角度分量与径向分量作用的受控实证研究，重新审视了这一假设。我们发现，引导方法的主要差异在于它们如何耦合两种几何效应：改变token与概念方向的角度对齐，以及改变其隐含状态范数。在七个语言模型上的实验表明，概念主要表征于角度结构中，这支持了球形方法的动机，但范数对引导的稳定性及下游影响仍然至关重要。我们的结果解释了为何具有相似概念层面效果的干预会表现出不同行为，并建议激活引导应通过干预中可解释的角度分量与径向分量进行参数化，而非通过将这两种效应纠缠在一起的单一加法系数。

English

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.