视频幻觉:评估大型视频-语言模型中的内在和外在幻觉
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
June 24, 2024
作者: Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng
cs.AI
摘要
最近,多模态大型语言模型(MLLMs)的进展已将其能力扩展到视频理解领域。然而,这些模型经常受到“幻觉”的困扰,即生成与实际视频内容偏离的无关或荒谬内容。本研究引入了VideoHallucer,这是第一个针对大型视频-语言模型(LVLMs)中幻觉检测的全面基准。VideoHallucer将幻觉分为两种主要类型:内在和外在,并提供进一步的子类别进行详细分析,包括对象关系、时间、语义细节、外在事实和外在非事实幻觉。我们采用对抗性二元VideoQA方法进行全面评估,其中精心设计了基本问题和幻觉问题的配对。通过在VideoHallucer上评估十一种LVLMs,我们揭示了:i)目前大多数模型存在幻觉方面的重大问题;ii)尽管扩展数据集和参数可以改善模型检测基本视觉线索和反事实的能力,但对于检测外在事实幻觉的效果有限;iii)现有模型更擅长检测事实而非识别幻觉。作为副产品,这些分析进一步指导了我们的自我PEP框架的发展,使所有模型架构的幻觉抵抗力平均提高了5.38%。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have extended
their capabilities to video understanding. Yet, these models are often plagued
by "hallucinations", where irrelevant or nonsensical content is generated,
deviating from the actual video context. This work introduces VideoHallucer,
the first comprehensive benchmark for hallucination detection in large
video-language models (LVLMs). VideoHallucer categorizes hallucinations into
two main types: intrinsic and extrinsic, offering further subcategories for
detailed analysis, including object-relation, temporal, semantic detail,
extrinsic factual, and extrinsic non-factual hallucinations. We adopt an
adversarial binary VideoQA method for comprehensive evaluation, where pairs of
basic and hallucinated questions are crafted strategically. By evaluating
eleven LVLMs on VideoHallucer, we reveal that i) the majority of current models
exhibit significant issues with hallucinations; ii) while scaling datasets and
parameters improves models' ability to detect basic visual cues and
counterfactuals, it provides limited benefit for detecting extrinsic factual
hallucinations; iii) existing models are more adept at detecting facts than
identifying hallucinations. As a byproduct, these analyses further instruct the
development of our self-PEP framework, achieving an average of 5.38%
improvement in hallucination resistance across all model architectures.Summary
AI-Generated Summary