분포 의미론적 추적: 대규모 언어 모델의 환각 현상을 설명하기 위한 프레임워크

초록

대형 언어 모델(LLMs)은 사실적으로 보이지만 사실과는 다른 진술을 생성하는 환각(hallucination) 현상에 취약하다. 본 연구는 이러한 실패 모드의 내재적, 구조적 기원을 세 가지 주요 기여를 통해 탐구한다. 첫째, 내부 의미론적 실패를 신뢰성 있게 추적하기 위해, 우리는 분포적 의미론(distributional semantics)을 맥락의 함수로 간주하여 모델의 추론 과정을 인과적으로 매핑하는 통합 프레임워크인 분포적 의미론 추적(Distributional Semantics Tracing, DST)을 제안한다. 둘째, 환각이 불가피해지는 모델의 계층을 특정하여, 모델의 내부 표현이 사실성에서 되돌릴 수 없이 벗어나는 특정 결정 계층(commitment layer)을 식별한다. 셋째, 이러한 실패의 근본적인 메커니즘을 규명한다. 우리는 이중 과정 이론(dual-process theory)의 관점에서 해석할 수 있는 두 가지 구별되는 계산 경로 간의 충돌을 관찰한다: 빠르고 휴리스틱한 연상 경로(System 1에 유사)와 느리고 신중한 맥락적 경로(System 2에 유사)로, 이는 '추론 단축점 탈취(Reasoning Shortcut Hijacks)'와 같은 예측 가능한 실패 모드로 이어진다. 우리의 프레임워크는 맥락적 경로의 일관성을 정량화할 수 있으며, 이는 환각 발생률과 강한 음의 상관관계(rho = -0.863)를 보여주어 이러한 실패가 내부 의미론적 약점의 예측 가능한 결과임을 시사한다. 이를 통해 트랜스포머(Transformer) 아키텍처 내에서 환각이 어떻게, 언제, 왜 발생하는지에 대한 기계론적 설명을 제공한다.

English

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation (rho = -0.863) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

분포 의미론적 추적: 대규모 언어 모델의 환각 현상을 설명하기 위한 프레임워크

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

초록

Support