言語モデルにおける真実性をモデル化する方法としてのペルソナ

要旨

大規模言語モデルは、インターネットから収集された膨大な量のテキストデータで訓練されます。このデータには、世界に関する事実と誤った情報の両方が含まれています。言語モデルは、この矛盾したデータの中で真実と虚偽を見分けることができるのでしょうか？言語モデルがコーパスを生成する異なるエージェントをモデル化できるという見解を拡張し、私たちは言語モデルが「真実を語るペルソナ」をモデル化することで、真実のテキストをクラスタリングできると仮説を立てました。このペルソナとは、真実のテキストを生成する可能性が高く、類似した特徴を共有するエージェントのグループです。例えば、WikipediaやScienceなどの信頼できる情報源は、通常フォーマルな文体を使用し、一貫した主張を行います。このペルソナをモデル化することで、言語モデルは、各エージェントが訓練テキストを生成した特定の文脈を超えて、真実性を一般化することができます。例えば、モデルは「Wikipedia」というエージェントが「Science」によってのみ生成されたトピックについても真実を語るように振る舞うと推論できます。なぜなら、それらは同じペルソナを共有しているからです。私たちはまず、次の2つの観察を通じてペルソナ仮説の証拠を示します：(1)モデルの回答が真実であるかどうかを、生成される前に探ることができる；(2)モデルを一連の事実でファインチューニングすると、未見のトピックに対する真実性が向上する。次に、算術を合成環境として使用し、言語モデルが真と偽の文を分離し、エージェント間で真実性を一般化できることを示します。ただし、これは訓練データ内のエージェントが真実の生成プロセスを共有し、真実のペルソナを作成できる場合に限ります。全体として、私たちの研究結果は、モデルがデータ内の階層構造を利用して、真実性のような抽象的な概念を学習できることを示唆しています。

English

Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

言語モデルにおける真実性をモデル化する方法としてのペルソナ

Personas as a Way to Model Truthfulness in Language Models

要旨

Support