MVEB：大规模视频嵌入基准

摘要

我们提出了大规模视频嵌入基准（MVEB），这是一个包含23个任务、涵盖分类、零样本分类、聚类、成对分类、检索和以视频为中心的问答的视频嵌入基准。我们评估了33个模型，发现没有任何单一模型占据主导地位：基于多模态大语言模型（MLLM）的嵌入在分类、聚类、成对分类和问答方面表现领先；多模态绑定在检索和零样本分类上领先；未经过对比自适应训练的生成式MLLM在跨模态任务上表现崩溃。通过成对的纯视频与音频+视频对比评估显示，音频的贡献取决于数据集的标注来源：当标签由两种模态共同生成时，音频有帮助；当标签仅由视觉模态生成时，音频反而有害，这一差距达六个百分点，且在不同模型家族中一致。MVEB源自一个包含184个任务的任务池MVEB+，其设计目的是在保持任务多样性的同时降低评估成本。它集成到MTEB生态系统中，以实现跨文本、图像、音频和视频的统一评估。我们在 https://github.com/embeddings-benchmark/mteb 上发布了MVEB及所有184个任务，以及相关代码和排行榜。

English

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.