人物角色作为模拟语言模型真实性的一种方式

摘要

大型语言模型是在互联网上大量文本的基础上进行训练的，这些文本包含关于世界的事实和误导性信息。语言模型能否在这些矛盾数据中区分真实和虚假？在扩展了对LLMs可以模拟不同生成语料库的代理的观点后，我们假设它们可以通过建模真实人设来对真实文本进行聚类：一组可能生成真实文本并具有相似特征的代理。例如，值得信赖的来源如维基百科和科学通常使用正式的写作风格并提出一致的主张。通过建模这个人设，LLMs可以将真实性概括到每个代理生成训练文本的特定上下文之外。例如，模型可以推断“维基百科”代理在只由“科学”生成的主题上会表现真实，因为它们共享一个人设。我们首先通过两点观察展示了人设假设的证据：（1）我们可以在生成之前探测模型的答案是否真实；（2）在一组事实上微调模型会提高其在未见主题上的真实性。接下来，通过算术作为一个合成环境，我们展示了语言模型可以区分真假陈述，并在代理之间概括真实性；但前提是训练数据中的代理共享一个能够创造真实人设的真实生成过程。总的来说，我们的研究结果表明模型可以利用数据中的分层结构来学习真实性等抽象概念。

English

Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

人物角色作为模拟语言模型真实性的一种方式

Personas as a Way to Model Truthfulness in Language Models

摘要

Support