分布意味論的追跡：大規模言語モデルにおける幻覚を説明するための枠組み

要旨

大規模言語モデル（LLMs）は、一見もっともらしいが事実上誤った記述を生成する「幻覚（hallucination）」を起こしやすい。本研究では、この失敗モードの内在的・構造的な起源を、以下の三つの主要な貢献を通じて探求する。第一に、内部的な意味論的失敗を確実に追跡するため、分布意味論（distributional semantics）としての意味を文脈の関数として扱い、モデルの推論の因果関係マップを生成するために確立された解釈可能性技術を統合した統一フレームワークである「分布意味論追跡（Distributional Semantics Tracing, DST）」を提案する。第二に、幻覚が不可避となるモデルの層を特定し、モデルの内部表現が事実性から不可逆的に乖離する特定の「コミットメント層（commitment layer）」を明らかにする。第三に、これらの失敗の根本的なメカニズムを特定する。我々は、異なる計算経路間の衝突を観察し、これを二重過程理論（dual-process theory）の視点から解釈する。すなわち、高速でヒューリスティックな連想経路（System 1に類似）と、低速で慎重な文脈経路（System 2に類似）の間の衝突が、「推論ショートカットハイジャック（Reasoning Shortcut Hijacks）」などの予測可能な失敗モードを引き起こすことを示す。我々のフレームワークは、文脈経路の一貫性を定量化する能力を持ち、幻覚発生率との強い負の相関（rho = -0.863）を明らかにし、これらの失敗が内部的な意味論的弱さの予測可能な結果であることを示唆する。その結果、Transformerアーキテクチャ内で幻覚がどのように、いつ、なぜ発生するかについてのメカニズム的説明が得られる。

English

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation (rho = -0.863) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

分布意味論的追跡：大規模言語モデルにおける幻覚を説明するための枠組み

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

要旨

Support