ChatPaper.aiChatPaper

大型語言模型並未真正知曉其所不知

Large Language Models Do NOT Really Know What They Don't Know

October 10, 2025
作者: Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng
cs.AI

摘要

近期研究表明,大型語言模型(LLMs)在其內部表徵(如隱藏狀態、注意力權重或詞元概率)中編碼了事實性信號,這暗示LLMs可能“知其所不知”。然而,LLMs也可能依賴捷徑或虛假關聯而產生事實錯誤。這些錯誤由鼓勵正確預測的同一訓練目標驅動,從而引發了內部計算能否可靠區分事實與虛構輸出的疑問。在本研究中,我們基於對主題信息依賴性的不同,對比了兩種類型的虛構,進行了LLMs內部處理事實查詢的機制分析。我們發現,當虛構與主題知識相關時,LLMs採用與正確響應相同的內部回憶過程,導致隱藏狀態幾何重疊且難以區分。相反,與主題知識脫節的虛構則產生獨特、聚類的表徵,使其可被檢測。這些發現揭示了一個根本性限制:LLMs並未在其內部狀態中編碼真實性,而僅編碼了知識回憶的模式,表明“LLMs實際上並不知其所不知”。
English
Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may "know what they don't know". However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that "LLMs don't really know what they don't know".
PDF162October 17, 2025