IV-Bench: マルチモーダルLLMにおける画像基盤型ビデオ知覚と推論のベンチマーク

要旨

既存のマルチモーダル大規模言語モデル（MLLM）の評価フレームワークは、主に画像推論や一般的な映像理解タスクに焦点を当てており、映像理解における画像コンテキストの重要な役割を見落としがちです。このギャップを埋めるため、我々は画像に基づく映像知覚と推論を評価する初の包括的ベンチマークであるIV-Benchを提案します。IV-Benchは、967本の映像と2,585の入念にアノテーションされた画像-テキストクエリで構成され、13のタスク（7つの知覚タスクと6つの推論タスク）と5つの代表的なカテゴリにわたります。最先端のオープンソース（例：InternVL2.5、Qwen2.5-VL）およびクローズドソース（例：GPT-4o、Gemini2-Flash、Gemini2-Pro）MLLMの広範な評価により、現在のモデルが画像に基づく映像知覚と推論において大幅に性能が低く、最大でも28.9%の精度しか達成できないことが示されました。さらに、推論パターン、フレーム数、解像度など、IV-Benchにおけるモデル性能に影響を与える主要な要因が明らかになりました。また、シンプルなデータ合成アプローチを通じて、IV-Benchの課題がトレーニングプロセスにおけるデータ形式の整合を超えていることを示しました。これらの発見は、今後の研究にとって貴重な洞察を提供します。我々のコードとデータはhttps://github.com/multimodal-art-projection/IV-Benchで公開されています。

English

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.

IV-Bench: マルチモーダルLLMにおける画像基盤型ビデオ知覚と推論のベンチマーク

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

要旨

Support