角色向量：语言模型中人物特征的监控与调控

摘要

大型语言模型通过模拟的“助手”角色与用户互动。尽管助手通常被训练得乐于助人、无害且诚实，但有时也会偏离这些理想状态。本文中，我们识别了模型激活空间中的多个特质（如邪恶、谄媚及幻觉倾向）所对应的“角色向量”。我们证实，这些向量可用于监测助手在部署时性格的波动。随后，我们应用角色向量来预测并控制在训练过程中发生的性格转变。研究发现，微调后有意与无意的性格变化均与相关角色向量的偏移密切相关。这些偏移可通过事后干预得到缓解，或通过一种新的预防性引导方法从一开始就避免。此外，角色向量还能用于标记在数据集层面及单个样本层面可能导致不良性格变化的训练数据。我们提取角色向量的方法实现了自动化，且仅需给定自然语言描述，即可应用于任何感兴趣的性格特质。

English

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

角色向量：语言模型中人物特征的监控与调控

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

摘要

Support