각도-노름 분해를 통한 활성화 조종의 기하학적 설명

초록

선형 활성화 스티어링은 언어 모델의 행동을 제어하는 간단하면서 경험적으로 효과적인 방법으로 인기를 얻었다. 보다 최근에는 덧셈적 개입의 한계를 해결하기 위해 구면 스티어링 패러다임이 제안되었으며, 이는 종종 은닉 상태의 노름이 개념 관련 정보를 전달하지 않는다는 가정에 기반한다. 본 연구에서는 각도 성분과 반경 성분의 역할을 분리하도록 설계된 통제된 경험적 연구를 통해 이 가정을 재검토한다. 우리는 스티어링 방법들이 주로 두 가지 기하학적 효과, 즉 토큰의 개념 방향에 대한 각도 정렬 변경과 은닉 상태 노름 변경을 결합하는 방식에서 차이가 있음을 보여준다. 일곱 개의 언어 모델에 걸쳐, 우리는 개념이 주로 각도 구조에 표현되어 구면 방법의 동기를 지지하지만, 노름은 스티어링의 안정성과 하위 효과에 여전히 중요함을 발견한다. 우리의 결과는 유사한 개념 수준 효과를 가진 개입들이 왜 다르게 행동할 수 있는지 설명하며, 활성화 스티어링은 이 두 효과를 얽히게 하는 단일 덧셈 계수보다는 개입의 해석 가능한 각도 및 반경 성분으로 매개변수화되어야 함을 시사한다.

English

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.