ELV-Halluc：长视频理解中的语义聚合幻觉基准测试

摘要

视频多模态大语言模型（Video-MLLMs）在视频理解领域取得了显著进展。然而，它们仍易产生与视频输入不一致或无关的幻觉内容。以往的视频幻觉基准主要针对短视频，将幻觉归因于强语言先验、缺失帧或视觉编码器引入的视觉-语言偏差等因素。虽然这些原因确实解释了短视频中的大多数幻觉，但它们仍过于简化了幻觉的成因。有时，模型会生成错误的输出，但帧级语义却是正确的。我们将这种类型的幻觉称为语义聚合幻觉（Semantic Aggregation Hallucination, SAH），它出现在将帧级语义聚合为事件级语义组的过程中。鉴于SAH在长视频中因跨多个事件的语义复杂性增加而变得尤为关键，有必要分离并深入研究这类幻觉的成因。为解决上述问题，我们推出了首个专注于长视频幻觉的基准ELV-Halluc，从而系统性地研究SAH。我们的实验证实了SAH的存在，并表明其随语义复杂性增加而加剧。此外，我们发现模型在语义快速变化时更易产生SAH。我们还探讨了缓解SAH的潜在方法，证明位置编码策略有助于减轻SAH，并进一步采用DPO策略增强模型区分事件内和跨事件语义的能力。为此，我们构建了一个包含8K对抗数据对的数据集，并在ELV-Halluc和Video-MME上均取得了改进，包括SAH比率大幅降低27.7%。

English

Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

ELV-Halluc：长视频理解中的语义聚合幻觉基准测试

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

摘要

Support