虚偽の解剖：視覚言語モデルにおける幻覚を追跡する多段階診断フレームワーク

要旨

視覚言語モデル（VLM）は、しばしば「幻覚」現象を引き起こす―事実上は誤りながらも説得力のある記述を生成する―これが信頼性のある実用化における重大な障壁となっている。本研究では、幻覚を静的な出力誤差としてではなく、モデルの計算的認知における動的な病理として再定義し、その診断を行う新たなパラダイムを提案する。我々の枠組みは計算合理性の規範的原則に基づいており、VLMの生成過程を動的な認知軌道としてモデル化することを可能にする。この軌道を解釈可能な低次元の認知状態空間へ射影する、情報理論に基づく一連のプローブを設計した。中核となる発見は、「幾何-情報双対性」と名付けた支配原理である：この空間内における認知軌道の幾何的異常性は、情報理論的な驚異値の高さと本質的に等価である。これにより、幻覚検出は幾何的異常検出問題として定式化される。厳密な二値QA（POPE）から包括的推論（MME）、さらに制約のない自由記述キャプション生成（MS-COCO）まで多様な設定で評価した結果、本枠組みは最先端の性能を達成した。決定的に、弱い教師信号の下で高効率に動作し、較正データが重度に汚染された場合でも高い頑健性を維持する。この手法は失敗の因果的帰属を可能にし、観測可能な誤差を異なる病理的状態―知覚的不安定性（知覚エントロピーで測定）、論理的因果関係の破綻（推論矛盾で測定）、決定的曖昧性（決定エントロピーで測定）―に対応付ける。最終的にこれは、推論過程が設計段階から透明で、監査可能、かつ診断可能なAIシステムの構築への道筋を開くものである。

English

Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.

虚偽の解剖：視覚言語モデルにおける幻覚を追跡する多段階診断フレームワーク

Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

要旨

Support