LLM知道的比它们展示的更多：关于LLM内在表示的幻觉

摘要

大型语言模型（LLMs）经常会产生错误，包括事实不准确、偏见和推理失败，统称为“幻觉”。最近的研究表明，LLMs的内部状态编码了关于其输出真实性的信息，这些信息可以用于检测错误。在这项工作中，我们展示了LLMs的内部表示编码了比以前认识到的更多关于真实性的信息。我们首先发现真实性信息集中在特定的标记中，利用这一特性显著提高了错误检测性能。然而，我们发现这种错误检测器无法在数据集之间泛化，这意味着 — 与先前的说法相反 — 真实性编码并非普适的，而是多方面的。接下来，我们展示内部表示还可以用于预测模型可能出现的错误类型，有助于制定量身定制的缓解策略。最后，我们揭示了LLMs的内部编码与外部行为之间的差异：它们可能编码了正确答案，但始终生成错误答案。综上所述，这些见解加深了我们对LLMs错误的理解，从模型内部视角指导未来增强错误分析和缓解的研究。

English

Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.

LLM知道的比它们展示的更多：关于LLM内在表示的幻觉

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

摘要

Support