ELV-Halluc: 長尺動画理解における意味的集約幻覚のベンチマーキング

要旨

ビデオマルチモーダル大規模言語モデル（Video-MLLMs）は、ビデオ理解において顕著な進歩を遂げている。しかし、これらのモデルは、ビデオ入力と一致しない、または無関係な内容を生成する幻覚（hallucination）に対して脆弱である。従来のビデオ幻覚ベンチマークは主に短編ビデオに焦点を当てており、幻覚の原因を強力な言語事前分布、欠落フレーム、または視覚エンコーダによって導入される視覚-言語バイアスなどの要因に帰している。これらの要因は確かに短編ビデオにおける幻覚の大部分を説明するが、幻覚の原因を過度に単純化している。時として、モデルは誤った出力を生成するが、フレームレベルの意味論は正しい場合がある。このタイプの幻覚を「意味的集約幻覚」（Semantic Aggregation Hallucination, SAH）と呼び、これはフレームレベルの意味論をイベントレベルの意味グループに集約する過程で生じる。SAHは、複数のイベントにわたる意味的複雑さが増す長編ビデオにおいて特に重要となるため、このタイプの幻覚の原因を分離し、徹底的に調査することが不可欠である。上記の問題に対処するため、我々は長編ビデオ幻覚に特化した初のベンチマークであるELV-Hallucを導入し、SAHの体系的な調査を可能にした。我々の実験はSAHの存在を確認し、それが意味的複雑さとともに増加することを示した。さらに、モデルが急速に変化する意味論に対してSAHを起こしやすいことも明らかになった。加えて、SAHを軽減するための潜在的なアプローチについて議論した。位置符号化戦略がSAHの軽減に寄与することを示し、さらにDPO戦略を採用して、モデルがイベント内およびイベント間の意味論を区別する能力を強化した。これを支援するため、8Kの敵対的データペアからなるデータセットをキュレーションし、ELV-HallucとVideo-MMEの両方で改善を達成し、SAH比率を大幅に27.7%削減した。

English

Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

ELV-Halluc: 長尺動画理解における意味的集約幻覚のベンチマーキング

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

要旨

Support