MVEB：大規模視頻嵌入基準

摘要

我們提出了大型影片嵌入基準測試（MVEB），這是一個涵蓋23個任務的影片嵌入基準，任務包括分類、零樣本分類、聚類、配對分類、檢索以及以影片為中心的問答。我們評估了33個模型，發現沒有任何單一模型占據主導地位：基於多模態大型語言模型（MLLM）的嵌入在分類、聚類、配對分類和問答方面領先；多模態綁定在檢索和零樣本分類方面領先；而沒有對比學習適應的生成式MLLM在跨模態任務上表現崩潰。透過僅影片與音訊+影片的配對評估顯示，音訊的貢獻取決於資料集標註來源：當標籤來自兩種模態時，音訊有幫助；而當標籤僅來自視覺時，音訊則造成負面影響，此差距在不同模型家族中一致達到六個百分點。MVEB源自一個包含184個任務的MVEB+，其設計旨在維持任務多樣性的同時降低評估成本。它整合到MTEB生態系統中，以實現文字、圖像、音訊和影片的統一評估。我們在https://github.com/embeddings-benchmark/mteb發布MVEB及所有184個任務，並附上程式碼和排行榜。

English

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.