Distributionele Semantiek Tracing: Een Raamwerk voor het Verklaren van Hallucinaties in Grote Taalmodellen

Samenvatting

Grote Taalmodellen (LLMs) zijn gevoelig voor hallucinatie, het genereren van plausibel maar feitelijk onjuiste uitspraken. Dit onderzoek bestudeert de intrinsieke, architectonische oorsprong van deze foutmodus via drie primaire bijdragen. Ten eerste stellen we, om het betrouwbaar traceren van interne semantische fouten mogelijk te maken, Distributional Semantics Tracing (DST) voor, een geïntegreerd raamwerk dat gevestigde interpreteerbaarheidstechnieken combineert om een causaal overzicht van het redeneerproces van een model te creëren, waarbij betekenis wordt behandeld als een functie van context (distributionele semantiek). Ten tweede identificeren we de laag in het model waarop een hallucinatie onvermijdelijk wordt, waarbij we een specifieke commitmentlaag aanwijzen waar de interne representaties van een model onomkeerbaar afwijken van de feitelijkheid. Ten derde identificeren we het onderliggende mechanisme voor deze fouten. We observeren een conflict tussen verschillende computationele paden, wat we interpreteren aan de hand van de dual-process theorie: een snel, heuristisch associatief pad (vergelijkbaar met Systeem 1) en een traag, weloverwogen contextueel pad (vergelijkbaar met Systeem 2), wat leidt tot voorspelbare foutmodi zoals Reasoning Shortcut Hijacks. Het vermogen van ons raamwerk om de coherentie van het contextuele pad te kwantificeren, onthult een sterke negatieve correlatie (rho = -0.863) met hallucinatiepercentages, wat impliceert dat deze fouten voorspelbare gevolgen zijn van interne semantische zwakte. Het resultaat is een mechanistische verklaring van hoe, wanneer en waarom hallucinaties optreden binnen de Transformer-architectuur.

English

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation (rho = -0.863) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

Distributionele Semantiek Tracing: Een Raamwerk voor het Verklaren van Hallucinaties in Grote Taalmodellen

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Samenvatting

Support