페르소나 벡터: 언어 모델의 캐릭터 특성 모니터링 및 제어

초록

대형 언어 모델은 시뮬레이션된 '어시스턴트' 페르소나를 통해 사용자와 상호작용합니다. 어시스턴트는 일반적으로 도움이 되고, 해를 끼치지 않으며, 정직하도록 훈련되지만, 때로는 이러한 이상에서 벗어나는 경우가 있습니다. 본 논문에서는 모델의 활성화 공간 내 페르소나 벡터를 통해 악의, 아첨, 환각 경향 등 여러 특성을 나타내는 방향을 식별합니다. 이러한 벡터가 배포 시점에서 어시스턴트의 성격 변동을 모니터링하는 데 사용될 수 있음을 확인합니다. 그런 다음, 페르소나 벡터를 적용하여 훈련 중 발생하는 성격 변화를 예측하고 제어합니다. 미세 조정 후 의도된 및 의도하지 않은 성격 변화가 관련 페르소나 벡터를 따라 이동하는 것과 강한 상관관계가 있음을 발견합니다. 이러한 변화는 사후 개입을 통해 완화하거나, 새로운 예방적 조정 방법을 통해 처음부터 방지할 수 있습니다. 또한, 페르소나 벡터는 데이터셋 수준 및 개별 샘플 수준에서 바람직하지 않은 성격 변화를 초래할 훈련 데이터를 식별하는 데 사용될 수 있습니다. 페르소나 벡터를 추출하는 우리의 방법은 자동화되어 있으며, 관심 있는 모든 성격 특성에 대해 자연어 설명만 주어지면 적용할 수 있습니다.

English

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

페르소나 벡터: 언어 모델의 캐릭터 특성 모니터링 및 제어

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

초록

Support