언어 모델의 진실성 모델링 방법으로서의 페르소나

초록

대형 언어 모델(LLM)은 인터넷에서 수집된 방대한 양의 텍스트 데이터로 학습되며, 이 데이터는 사실과 오류 정보가 혼재되어 있습니다. 이러한 상반된 데이터 속에서 언어 모델이 진실과 거짓을 구별할 수 있을까요? LLM이 다양한 주체들이 생성한 코퍼스를 모델링할 수 있다는 관점을 확장하여, 우리는 언어 모델이 '진실적인 페르소나'를 모델링함으로써 진실적인 텍스트를 클러스터링할 수 있다는 가설을 제안합니다. 여기서 진실적인 페르소나란, 진실적인 텍스트를 생성할 가능성이 높고 유사한 특징을 공유하는 주체들의 집합을 의미합니다. 예를 들어, 위키피디아나 과학 저널과 같은 신뢰할 수 있는 출처는 일반적으로 공식적인 글쓰기 스타일을 사용하고 일관된 주장을 펼칩니다. 이러한 페르소나를 모델링함으로써, LLM은 각 주체가 학습 텍스트를 생성한 특정 맥락을 넘어 진실성을 일반화할 수 있습니다. 예를 들어, 모델은 "위키피디아"라는 주체가 "과학" 저널에서만 생성된 주제에 대해서도 진실적으로 행동할 것이라고 추론할 수 있습니다. 왜냐하면 이들은 동일한 페르소나를 공유하기 때문입니다. 우리는 먼저 두 가지 관찰을 통해 페르소나 가설에 대한 증거를 제시합니다: (1) 모델의 답변이 생성되기 전에 그 답변이 진실적인지 탐색할 수 있으며, (2) 모델을 일련의 사실에 대해 미세 조정하면 보이지 않는 주제에 대한 진실성이 향상됩니다. 다음으로, 산술을 합성 환경으로 사용하여 언어 모델이 진술의 진실과 거짓을 분리하고 주체 간에 진실성을 일반화할 수 있음을 보여줍니다. 그러나 이는 학습 데이터의 주체들이 진실적인 생성 과정을 공유하여 진실적인 페르소나를 형성할 수 있는 경우에만 가능합니다. 전반적으로, 우리의 연구 결과는 모델이 데이터의 계층적 구조를 활용하여 진실성과 같은 추상적 개념을 학습할 수 있음을 시사합니다.

English

Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

언어 모델의 진실성 모델링 방법으로서의 페르소나

Personas as a Way to Model Truthfulness in Language Models

초록

Support