分布语义追踪：解释大语言模型幻觉现象的框架

摘要

大型语言模型（LLMs）易产生幻觉，即生成看似合理但事实错误的陈述。本研究通过三项主要贡献，深入探讨了这一失效模式的内在架构根源。首先，为可靠追踪内部语义失效，我们提出了分布语义追踪（DST），这是一个统一框架，整合了现有的可解释性技术，以生成模型推理的因果图，将意义视为上下文的函数（分布语义）。其次，我们确定了模型层级的“承诺层”，在此层，模型的内部表征不可避免地偏离事实性，标志着幻觉的必然发生。第三，我们揭示了这些失效的底层机制。我们观察到不同计算路径之间的冲突，并运用双系统理论进行解读：一条快速、启发式的联想路径（类似系统1）与一条缓慢、审慎的上下文路径（类似系统2），导致了诸如“推理捷径劫持”等可预测的失效模式。我们的框架能够量化上下文路径的连贯性，揭示其与幻觉率之间存在强烈的负相关性（rho = -0.863），表明这些失效是内部语义弱点的可预见结果。最终，我们提供了关于Transformer架构中幻觉如何、何时及为何发生的机制性解释。

English

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation (rho = -0.863) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

分布语义追踪：解释大语言模型幻觉现象的框架

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

摘要

Support