人物角色作為模擬語言模型真實性的方法

摘要

大型語言模型是在網路上龐大文本的基礎上進行訓練的，其中包含有關世界的事實和誤導性資訊。語言模型能否從這些矛盾的數據中分辨真假呢？進一步探討認為大型語言模型可以模擬不同產生語料的代理人的觀點，我們假設它們可以通過建模真實的人物形象來將真實文本進行分類：一組可能產生真實文本並具有相似特徵的代理人。例如，值得信賴的來源如維基百科和科學通常使用正式的寫作風格並提出一致的主張。通過建模這種人物形象，語言模型可以將真實性概括到每個代理人生成訓練文本的特定情境之外。例如，模型可以推斷“維基百科”代理人在只有“科學”生成的主題上會表現出真實性，因為它們共享一個人物形象。我們首先通過兩個觀察來證明人物形象假設：（1）我們可以在生成之前探測模型的答案是否真實；（2）在一組事實上微調模型可以提高其對未知主題的真實性。接下來，通過算術作為一個合成環境，我們展示語言模型可以區分真假陳述，並在不同代理人之間概括真實性；但前提是訓練數據中的代理人共享一個能夠創建真實人物形象的真實生成過程。總的來說，我們的研究結果表明模型可以利用數據中的層次結構來學習真實性等抽象概念。

English

Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

人物角色作為模擬語言模型真實性的方法

Personas as a Way to Model Truthfulness in Language Models

摘要

Support