ペルソナベクトル：言語モデルにおけるキャラクター特性の監視と制御

要旨

大規模言語モデルは、シミュレートされた「アシスタント」のペルソナを通じてユーザーと対話します。アシスタントは通常、役に立ち、無害で、誠実であるように訓練されていますが、時としてこれらの理想から逸脱することがあります。本論文では、悪意、おべっか、幻覚を起こしやすい傾向など、いくつかの特性を表すモデルの活性化空間における「ペルソナベクトル」の方向性を特定します。これらのベクトルが、デプロイ時にアシスタントの性格の変動を監視するために使用できることを確認します。次に、ペルソナベクトルを適用して、訓練中に発生する性格の変化を予測し、制御します。ファインチューニング後の意図的および意図しない性格の変化が、関連するペルソナベクトルに沿ったシフトと強く相関していることを発見しました。これらのシフトは、事後の介入によって軽減できるか、新しい予防的ステアリング手法を用いて最初から回避することが可能です。さらに、ペルソナベクトルは、データセットレベルおよび個々のサンプルレベルで、望ましくない性格の変化を引き起こす訓練データをフラグ付けするために使用できます。ペルソナベクトルを抽出する私たちの方法は自動化されており、自然言語の記述さえあれば、任意の興味深い性格特性に適用することができます。

English

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

ペルソナベクトル：言語モデルにおけるキャラクター特性の監視と制御

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

要旨

Support