谎言解构:追踪视觉语言模型幻觉的多阶段诊断框架
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models
March 16, 2026
作者: Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang
cs.AI
摘要
视觉语言模型(VLM)常出现"幻觉"现象——生成看似合理但事实错误的陈述——这成为其可信部署的关键障碍。本研究提出一种诊断幻觉的新范式,将其从静态输出错误重新定义为模型计算认知的动态病理现象。我们的框架基于计算理性化的规范原则,将VLM的生成过程建模为动态认知轨迹。我们设计了一套信息论探针,将该轨迹投射到可解释的低维认知状态空间。核心发现是被我们称为"几何-信息对偶性"的支配原则:认知轨迹在状态空间内的几何异常性本质上等同于其高信息论惊异值。幻觉检测由此转化为几何异常检测问题。在多样化场景的评估中——从严谨的二元问答(POPE)和综合推理(MME)到无约束开放描述(MS-COCO)——我们的框架均实现了最先进性能。关键在于,该框架能在弱监督下高效运行,即使校准数据严重污染仍保持高度鲁棒性。这种方法实现了故障的因果归因,将可观测错误映射至不同病理状态:感知不稳定性(通过感知熵度量)、逻辑因果失效(通过推理冲突度量)以及决策模糊性(通过决策熵度量)。这最终为构建具有透明、可审计、可诊断特性的AI系统开辟了新路径。
English
Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.