MVEB: 大規模ビデオ埋め込みベンチマーク

要旨

我々は、ビデオ埋め込みのための大規模ベンチマークであるMassive Video Embedding Benchmark (MVEB)を紹介する。MVEBは、分類、ゼロショット分類、クラスタリング、ペア分類、検索、ビデオ中心の質問応答にわたる23タスクから構成される。33のモデルを評価した結果、単一のモデルが全てを支配するわけではないことが判明した。MLLMベースの埋め込みは、分類、クラスタリング、ペア分類、QAにおいて優位を示し、マルチモーダルバインディングは検索とゼロショット分類でリードする。一方、対照的適応を伴わない生成的MLLMは、クロスモーダルタスクで性能が低下する。ビデオのみと音声＋ビデオのペア評価から、音声の寄与はデータセットアノテーションの出所に依存することが明らかになった。すなわち、ラベルが両モダリティから生成された場合には音声が有効であるが、視覚のみから生成された場合には有害であり、この6ポイントの差はモデルファミリー間で一貫している。MVEBは、184タスクのプールであるMVEB+から派生し、タスクの多様性を維持しつつ評価コストを削減するよう設計されている。これはMTEBエコシステムに統合され、テキスト、画像、音声、ビデオにわたる統一評価を可能にする。我々はMVEBおよび全184タスクを、コードとリーダーボードと共にhttps://github.com/embeddings-benchmark/mtebで公開する。

English

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.