ChatPaper.aiChatPaper

助手軸:定位與穩定語言模型的預設角色

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

January 15, 2026
作者: Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey
cs.AI

摘要

大型語言模型能夠呈現多種角色特徵,但其通常在後訓練階段形成的預設身份為「助手模式」。我們透過提取對應不同角色原型的神經元激活方向,探究模型角色空間的結構。在多個不同模型中發現,該角色空間的主導成分是一條「助手軸線」,其刻畫了模型在預設助手模式下運作的程度。向助手方向調控會強化有益無害的行為;反向調控則增強模型認同其他實體的傾向。值得注意的是,當以較極端值反向調控時,常會誘發神秘戲劇化的表達風格。研究發現該軸線同樣存在於預訓練模型中,其主要促進顧問、教練等有益人類原型,同時抑制靈性類角色。透過測量助手軸線上的偏離程度,可預測「角色漂移」現象——即模型偏離其典型角色特徵,表現出有害或怪異行為。我們發現角色漂移常由兩種對話情境驅動:要求模型對自身運作過程進行元反思的對話,或涉及情感脆弱用户的對話。實驗表明,將神經元激活限制在助手軸線的固定區間內,能有效穩定模型在這些情境下的行為表現,並能抵禦基於角色操控的對抗性攻擊。這些結果表明,後訓練雖能將模型導向角色空間的特定區域,但僅實現了鬆散的錨定,這啟發我們需要開發能更深度固化模型角色一致性的訓練與調控策略。
English
Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
PDF61January 21, 2026