ELV-Halluc: Benchmarking van semantische aggregatie-hallucinaties in langdurige videobegrip

Samenvatting

Video multimodale grote taalmodellen (Video-MLLMs) hebben opmerkelijke vooruitgang geboekt in videobegrip. Ze blijven echter kwetsbaar voor hallucinaties die inhoud produceren die inconsistent is met of niet gerelateerd is aan de video-invoer. Eerdere benchmarks voor videohallucinaties richten zich voornamelijk op korte video's. Ze schrijven hallucinaties toe aan factoren zoals sterke taalprioriteiten, ontbrekende frames of visueel-taalkundige vooroordelen die door de visuele encoder worden geïntroduceerd. Hoewel deze oorzaken inderdaad de meeste hallucinaties in korte video's verklaren, oversimplificeren ze nog steeds de oorzaak van hallucinaties. Soms genereren modellen incorrecte uitvoer, maar met correcte frame-niveau semantiek. We verwijzen naar dit type hallucinatie als Semantische Aggregatie Hallucinatie (SAH), die ontstaat tijdens het proces van het aggregeren van frame-niveau semantiek naar gebeurtenis-niveau semantische groepen. Gezien het feit dat SAH vooral kritiek wordt in lange video's vanwege de toegenomen semantische complexiteit over meerdere gebeurtenissen, is het essentieel om de oorzaken van dit type hallucinatie te scheiden en grondig te onderzoeken. Om de bovenstaande problemen aan te pakken, introduceren we ELV-Halluc, de eerste benchmark gewijd aan hallucinaties in lange video's, wat een systematisch onderzoek van SAH mogelijk maakt. Onze experimenten bevestigen het bestaan van SAH en tonen aan dat het toeneemt met semantische complexiteit. Daarnaast vinden we dat modellen gevoeliger zijn voor SAH bij snel veranderende semantiek. Bovendien bespreken we potentiële benaderingen om SAH te verminderen. We demonstreren dat de positionele coderingsstrategie bijdraagt aan het verminderen van SAH, en nemen verder de DPO-strategie over om het vermogen van het model om semantiek binnen en tussen gebeurtenissen te onderscheiden te verbeteren. Om dit te ondersteunen, stellen we een dataset samen van 8K adversariële dataparen en behalen we verbeteringen op zowel ELV-Halluc als Video-MME, inclusief een aanzienlijke vermindering van 27,7% in de SAH-ratio.

English

Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

ELV-Halluc: Benchmarking van semantische aggregatie-hallucinaties in langdurige videobegrip

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Samenvatting

Support