ELV-Halluc:长视频理解中的语义聚合幻觉基准测试
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
August 29, 2025
作者: Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu
cs.AI
摘要
视频多模态大语言模型(Video-MLLMs)在视频理解领域取得了显著进展。然而,它们仍易产生与视频输入不一致或无关的幻觉内容。以往的视频幻觉基准主要针对短视频,将幻觉归因于强语言先验、缺失帧或视觉编码器引入的视觉-语言偏差等因素。虽然这些原因确实解释了短视频中的大多数幻觉,但它们仍过于简化了幻觉的成因。有时,模型会生成错误的输出,但帧级语义却是正确的。我们将这种类型的幻觉称为语义聚合幻觉(Semantic Aggregation Hallucination, SAH),它出现在将帧级语义聚合为事件级语义组的过程中。鉴于SAH在长视频中因跨多个事件的语义复杂性增加而变得尤为关键,有必要分离并深入研究这类幻觉的成因。为解决上述问题,我们推出了首个专注于长视频幻觉的基准ELV-Halluc,从而系统性地研究SAH。我们的实验证实了SAH的存在,并表明其随语义复杂性增加而加剧。此外,我们发现模型在语义快速变化时更易产生SAH。我们还探讨了缓解SAH的潜在方法,证明位置编码策略有助于减轻SAH,并进一步采用DPO策略增强模型区分事件内和跨事件语义的能力。为此,我们构建了一个包含8K对抗数据对的数据集,并在ELV-Halluc和Video-MME上均取得了改进,包括SAH比率大幅降低27.7%。
English
Video multimodal large language models (Video-MLLMs) have achieved remarkable
progress in video understanding. However, they remain vulnerable to
hallucination-producing content inconsistent with or unrelated to video inputs.
Previous video hallucination benchmarks primarily focus on short-videos. They
attribute hallucinations to factors such as strong language priors, missing
frames, or vision-language biases introduced by the visual encoder. While these
causes indeed account for most hallucinations in short videos, they still
oversimplify the cause of hallucinations. Sometimes, models generate incorrect
outputs but with correct frame-level semantics. We refer to this type of
hallucination as Semantic Aggregation Hallucination (SAH), which arises during
the process of aggregating frame-level semantics into event-level semantic
groups. Given that SAH becomes particularly critical in long videos due to
increased semantic complexity across multiple events, it is essential to
separate and thoroughly investigate the causes of this type of hallucination.
To address the above issues, we introduce ELV-Halluc, the first benchmark
dedicated to long-video hallucination, enabling a systematic investigation of
SAH. Our experiments confirm the existence of SAH and show that it increases
with semantic complexity. Additionally, we find that models are more prone to
SAH on rapidly changing semantics. Moreover, we discuss potential approaches to
mitigate SAH. We demonstrate that positional encoding strategy contributes to
alleviating SAH, and further adopt DPO strategy to enhance the model's ability
to distinguish semantics within and across events. To support this, we curate a
dataset of 8K adversarial data pairs and achieve improvements on both
ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.