ELV-Halluc：長視頻理解中的語義聚合幻覺基準測試

摘要

視頻多模態大語言模型（Video-MLLMs）在視頻理解方面取得了顯著進展。然而，它們仍然容易產生與視頻輸入不一致或無關的幻覺內容。以往的視頻幻覺基準主要集中於短視頻，並將幻覺歸因於強語言先驗、缺失幀或視覺編碼器引入的視覺-語言偏差等因素。雖然這些原因確實解釋了短視頻中的大多數幻覺，但它們仍然過於簡化了幻覺的成因。有時，模型會生成錯誤的輸出，但幀級語義卻是正確的。我們將這種類型的幻覺稱為語義聚合幻覺（Semantic Aggregation Hallucination, SAH），它發生在將幀級語義聚合為事件級語義組的過程中。考慮到由於多個事件之間語義複雜性的增加，SAH在長視頻中變得尤為關鍵，因此有必要分離並深入研究這種幻覺的成因。為解決上述問題，我們引入了ELV-Halluc，這是首個專注於長視頻幻覺的基準，能夠系統性地研究SAH。我們的實驗證實了SAH的存在，並顯示其隨著語義複雜性的增加而增加。此外，我們發現模型在語義快速變化的情況下更容易產生SAH。此外，我們討論了緩解SAH的潛在方法。我們證明位置編碼策略有助於減輕SAH，並進一步採用DPO策略來增強模型區分事件內外語義的能力。為支持這一點，我們整理了一個包含8K對抗數據對的數據集，並在ELV-Halluc和Video-MME上取得了改進，包括SAH比率大幅降低了27.7%。

English

Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

ELV-Halluc：長視頻理解中的語義聚合幻覺基準測試

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

摘要

Support