角度-ノルム分解による活性化ステアリングの幾何学的解釈

要旨

線形アクティベーション・ステアリングは、言語モデルの振る舞いを制御するためのシンプルで経験的に効果的な方法として広く用いられるようになっている。より最近では、球面ステアリングパラダイムが、加法的介入の限界に対処するために提案されており、その背景には、隠れ状態のノルムが概念に関連する情報を持たないという仮定がしばしば存在する。本研究では、角度成分と半径成分の役割を切り離すように設計された制御された実証研究を通じて、この仮定を再検討する。ステアリング手法の違いは、主に、2つの幾何学的効果、すなわちトークンの角度方向と概念方向のアライメントの変化とその隠れ状態のノルムの変化をどのように組み合わせるかにあることを示す。7つの言語モデルにわたって、概念は主に角度構造で表現されていることが分かり、球面手法の動機づけを裏付けるが、ノルムはステアリングの安定性と下流効果にとって依然として重要である。我々の結果は、類似した概念レベルの効果を持つ介入がなぜ異なる振る舞いをするのかを説明し、アクティベーション・ステアリングは、これら2つの効果を絡み合わせる単一の加法的係数ではなく、介入の解釈可能な角度成分と半径成分によってパラメータ化されるべきであることを示唆する。

English

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.