滿足注意力：對語言模型事實錯誤的限制滿足觀點

摘要

我們研究基於Transformer的大型語言模型（LLMs）在生成事實錯誤文本時的內部行為。我們提出將事實查詢建模為約束滿足問題，並利用這個框架來研究模型如何在內部與事實約束互動。具體來說，我們發現模型對約束標記的關注程度與其回應的事實準確性之間存在著強烈的正相關。在我們精心挑選的11個數據集中，包含超過40,000個提示，我們研究了使用Llama-2系列在所有規模（7B、13B、70B）上預測事實錯誤的任務。我們提出了SAT Probe方法，這是一種探測自注意力模式的方法，可以預測約束滿足和事實錯誤，並允許早期錯誤識別。這種方法和研究結果展示了如何利用對LLMs中事實性的機械理解來增強可靠性。

English

We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as Constraint Satisfaction Problems and use this framework to investigate how the model interacts internally with factual constraints. Specifically, we discover a strong positive relation between the model's attention to constraint tokens and the factual accuracy of its responses. In our curated suite of 11 datasets with over 40,000 prompts, we study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing self-attention patterns, that can predict constraint satisfaction and factual errors, and allows early error identification. The approach and findings demonstrate how using the mechanistic understanding of factuality in LLMs can enhance reliability.

滿足注意力：對語言模型事實錯誤的限制滿足觀點

Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models

摘要

Support