人物向量：監控與控制語言模型中的角色特徵

摘要

大型語言模型通過模擬的「助手」角色與用戶互動。雖然助手通常被訓練得樂於助人、無害且誠實，但有時會偏離這些理想狀態。在本文中，我們識別了模型激活空間中的方向——即角色向量——這些方向體現了多種特質，如邪惡、諂媚和產生幻覺的傾向。我們證實，這些向量可用於監測助手在部署時性格的波動。接著，我們應用角色向量來預測和控制訓練過程中發生的性格轉變。我們發現，無論是微調後有意還是無意的性格變化，都與相關角色向量的偏移強烈相關。這些偏移可以通過事後干預來緩解，或者通過一種新的預防性引導方法從一開始就避免。此外，角色向量可用於標記會導致不良性格變化的訓練數據，無論是在數據集層面還是單個樣本層面。我們提取角色向量的方法是自動化的，並且可以應用於任何感興趣的性格特質，只需提供自然語言描述即可。

English

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

人物向量：監控與控制語言模型中的角色特徵

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

摘要

Support