ELV-Halluc:長視頻理解中的語義聚合幻覺基準測試
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
August 29, 2025
作者: Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu
cs.AI
摘要
視頻多模態大語言模型(Video-MLLMs)在視頻理解方面取得了顯著進展。然而,它們仍然容易產生與視頻輸入不一致或無關的幻覺內容。以往的視頻幻覺基準主要集中於短視頻,並將幻覺歸因於強語言先驗、缺失幀或視覺編碼器引入的視覺-語言偏差等因素。雖然這些原因確實解釋了短視頻中的大多數幻覺,但它們仍然過於簡化了幻覺的成因。有時,模型會生成錯誤的輸出,但幀級語義卻是正確的。我們將這種類型的幻覺稱為語義聚合幻覺(Semantic Aggregation Hallucination, SAH),它發生在將幀級語義聚合為事件級語義組的過程中。考慮到由於多個事件之間語義複雜性的增加,SAH在長視頻中變得尤為關鍵,因此有必要分離並深入研究這種幻覺的成因。為解決上述問題,我們引入了ELV-Halluc,這是首個專注於長視頻幻覺的基準,能夠系統性地研究SAH。我們的實驗證實了SAH的存在,並顯示其隨著語義複雜性的增加而增加。此外,我們發現模型在語義快速變化的情況下更容易產生SAH。此外,我們討論了緩解SAH的潛在方法。我們證明位置編碼策略有助於減輕SAH,並進一步採用DPO策略來增強模型區分事件內外語義的能力。為支持這一點,我們整理了一個包含8K對抗數據對的數據集,並在ELV-Halluc和Video-MME上取得了改進,包括SAH比率大幅降低了27.7%。
English
Video multimodal large language models (Video-MLLMs) have achieved remarkable
progress in video understanding. However, they remain vulnerable to
hallucination-producing content inconsistent with or unrelated to video inputs.
Previous video hallucination benchmarks primarily focus on short-videos. They
attribute hallucinations to factors such as strong language priors, missing
frames, or vision-language biases introduced by the visual encoder. While these
causes indeed account for most hallucinations in short videos, they still
oversimplify the cause of hallucinations. Sometimes, models generate incorrect
outputs but with correct frame-level semantics. We refer to this type of
hallucination as Semantic Aggregation Hallucination (SAH), which arises during
the process of aggregating frame-level semantics into event-level semantic
groups. Given that SAH becomes particularly critical in long videos due to
increased semantic complexity across multiple events, it is essential to
separate and thoroughly investigate the causes of this type of hallucination.
To address the above issues, we introduce ELV-Halluc, the first benchmark
dedicated to long-video hallucination, enabling a systematic investigation of
SAH. Our experiments confirm the existence of SAH and show that it increases
with semantic complexity. Additionally, we find that models are more prone to
SAH on rapidly changing semantics. Moreover, we discuss potential approaches to
mitigate SAH. We demonstrate that positional encoding strategy contributes to
alleviating SAH, and further adopt DPO strategy to enhance the model's ability
to distinguish semantics within and across events. To support this, we curate a
dataset of 8K adversarial data pairs and achieve improvements on both
ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.