透過角度-範數分解進行激活引導的幾何說明

摘要

線性激活引導因簡單且經實證有效而廣受歡迎，成為控制語言模型行為的常用方法。近期，研究者提出了球形引導範式，試圖解決加法干預的局限性，其動機常基於隱藏狀態範數不攜帶概念相關資訊的假設。本研究透過設計控制性實證實驗，重新審視此假設，旨在釐清角度分量與徑向分量的角色。我們發現不同引導方法的主要差異在於如何耦合兩種幾何效應：改變詞元與概念方向的角度對齊，以及改變其隱藏狀態範數。在七個語言模型中，我們觀察到概念主要體現在角度結構中，支持球形方法的動機，但範數對引導的穩定性與下游效應仍至關重要。研究結果解釋了為何具有相似概念層級效應的干預行為可能表現各異，並建議活化引導應以可解釋的角度與徑向分量參數化，而非透過單一加法係數綑綁這兩種效應。

English

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.