大型语言模型并非真正了解它们所未知的内容。
Large Language Models Do NOT Really Know What They Don't Know
October 10, 2025
作者: Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng
cs.AI
摘要
近期研究表明,大型语言模型(LLMs)在其内部表征中编码了事实性信号,如隐藏状态、注意力权重或词元概率,暗示LLMs可能“知道它们不知道什么”。然而,LLMs也可能因依赖捷径或虚假关联而产生事实错误。这些错误由鼓励正确预测的同一训练目标驱动,引发了一个疑问:内部计算能否可靠地区分事实输出与幻觉输出。在本研究中,我们通过比较基于主体信息依赖性的两类幻觉,对LLMs处理事实查询的内部机制进行了深入分析。我们发现,当幻觉与主体知识相关联时,LLMs采用与正确响应相同的内部回忆过程,导致隐藏状态几何重叠且难以区分。相反,脱离主体知识的幻觉则产生独特、聚集的表征,使其可被检测。这些发现揭示了一个根本性局限:LLMs并未在其内部状态中编码真实性,而仅编码知识回忆的模式,证明“LLMs实际上并不知道它们不知道什么”。
English
Recent work suggests that large language models (LLMs) encode factuality
signals in their internal representations, such as hidden states, attention
weights, or token probabilities, implying that LLMs may "know what they don't
know". However, LLMs can also produce factual errors by relying on shortcuts or
spurious associations. These error are driven by the same training objective
that encourage correct predictions, raising the question of whether internal
computations can reliably distinguish between factual and hallucinated outputs.
In this work, we conduct a mechanistic analysis of how LLMs internally process
factual queries by comparing two types of hallucinations based on their
reliance on subject information. We find that when hallucinations are
associated with subject knowledge, LLMs employ the same internal recall process
as for correct responses, leading to overlapping and indistinguishable
hidden-state geometries. In contrast, hallucinations detached from subject
knowledge produce distinct, clustered representations that make them
detectable. These findings reveal a fundamental limitation: LLMs do not encode
truthfulness in their internal states but only patterns of knowledge recall,
demonstrating that "LLMs don't really know what they don't know".