分佈語義追蹤：解析大型語言模型幻覺現象的框架

摘要

大型語言模型（LLMs）易於產生幻覺，即生成看似合理但實際上不正確的陳述。本研究通過三項主要貢獻探討了這種失敗模式的內在架構根源。首先，為了可靠地追蹤內部語義失敗，我們提出了分佈語義追蹤（DST），這是一個整合了既定可解釋性技術的統一框架，旨在生成模型推理的因果圖，將意義視為上下文（分佈語義）的函數。其次，我們確定了模型層級中幻覺不可避免的發生點，識別出一個特定的承諾層，在此層級模型的內部表示與事實性不可逆地偏離。第三，我們揭示了這些失敗的底層機制。我們觀察到不同計算路徑之間的衝突，並使用雙過程理論的視角進行解釋：一個快速、啟發式的聯想路徑（類似於系統1）和一個緩慢、深思熟慮的上下文路徑（類似於系統2），導致了可預測的失敗模式，如推理捷徑劫持。我們框架量化上下文路徑連貫性的能力揭示了其與幻覺率之間的強烈負相關（rho = -0.863），這意味著這些失敗是內部語義弱點的可預測結果。最終，我們提供了一個關於Transformer架構中幻覺如何、何時以及為何發生的機制性解釋。

English

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation (rho = -0.863) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

分佈語義追蹤：解析大型語言模型幻覺現象的框架

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

摘要

Support