MVEB: 대규모 비디오 임베딩 벤치마크

초록

우리는 비디오 임베딩을 위한 23개 작업 벤치마크인 MVEB(Massive Video Embedding Benchmark)을 소개한다. 이 벤치마크는 분류, 제로샷 분류, 클러스터링, 쌍 분류, 검색, 비디오 중심 질의응답을 포함한다. 33개의 모델을 평가한 결과, 단일 모델이 모든 영역을 지배하지는 않음을 발견했다: MLLM 기반 임베딩은 분류, 클러스터링, 쌍 분류, 질의응답에서 선두를 차지했으며, 멀티모달 바인딩은 검색 및 제로샷 분류에서 우수한 성능을 보였다. 대조 학습 없이 생성된 MLLM은 교차 모달 작업에서 성능이 급격히 저하되었다. 비디오 단독 대비 오디오+비디오 평가 결과, 오디오의 기여도는 데이터셋 주석 출처에 따라 달라졌다: 오디오는 두 모달리티로부터 레이블이 생성된 경우 성능을 향상시켰지만, 시각 정보만으로 레이블이 생성된 경우에는 성능을 저하시켰으며, 이 차이는 모델군 전반에 걸쳐 6% 포인트의 일관된 격차를 보였다. MVEB는 184개 작업 풀인 MVEB+에서 파생되었으며, 작업 다양성을 유지하면서 평가 비용을 줄이도록 설계되었다. 이는 MTEB 생태계에 통합되어 텍스트, 이미지, 오디오, 비디오에 걸친 통합 평가를 가능하게 한다. MVEB 및 184개 전체 작업을 코드 및 리더보드와 함께 https://github.com/embeddings-benchmark/mteb에서 공개한다.

English

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.