Traçabilité Sémantique Distributionnelle : Un Cadre pour Expliquer les Hallucinations dans les Modèles de Langage à Grande Échelle

papers.abstract

Les modèles de langage de grande taille (LLMs) sont sujets à l'hallucination, c'est-à-dire à la génération d'énoncés plausibles mais factuellement incorrects. Ce travail explore les origines intrinsèques et architecturales de ce mode de défaillance à travers trois contributions principales. Premièrement, pour permettre un suivi fiable des échecs sémantiques internes, nous proposons le traçage sémantique distributionnel (Distributional Semantics Tracing, DST), un cadre unifié qui intègre des techniques d'interprétabilité établies pour produire une carte causale du raisonnement d'un modèle, en traitant le sens comme une fonction du contexte (sémantique distributionnelle). Deuxièmement, nous identifions la couche du modèle à laquelle une hallucination devient inévitable, en déterminant une couche d'engagement spécifique où les représentations internes du modèle divergent irréversiblement de la factualité. Troisièmement, nous identifions le mécanisme sous-jacent à ces échecs. Nous observons un conflit entre des voies de calcul distinctes, que nous interprétons à travers la théorie des processus doubles : une voie associative heuristique rapide (similaire au Système 1) et une voie contextuelle lente et délibérée (similaire au Système 2), conduisant à des modes de défaillance prévisibles tels que les détournements de raccourcis de raisonnement. La capacité de notre cadre à quantifier la cohérence de la voie contextuelle révèle une forte corrélation négative (rho = -0,863) avec les taux d'hallucination, impliquant que ces échecs sont des conséquences prévisibles de faiblesses sémantiques internes. Le résultat est une explication mécanistique de comment, quand et pourquoi les hallucinations se produisent au sein de l'architecture Transformer.

English

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation (rho = -0.863) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

Traçabilité Sémantique Distributionnelle : Un Cadre pour Expliquer les Hallucinations dans les Modèles de Langage à Grande Échelle

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

papers.abstract

Support