感知測試:多模態視頻模型的診斷基準
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
May 23, 2023
作者: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
cs.AI
摘要
我們提出了一個新穎的多模態視頻基準測試 - 「感知測試」,用於評估預先訓練的多模態模型(例如Flamingo、BEiT-3或GPT-4)的感知和推理能力。與現有專注於計算任務(例如分類、檢測或跟踪)的基準測試相比,「感知測試」著重於技能(記憶、抽象、物理、語義)和推理類型(描述性、解釋性、預測性、反事實性),跨越視頻、音頻和文本模態,提供了一個全面且高效的評估工具。該基準測試通過零樣本/少樣本或有限微調的方式,探測預先訓練模型的轉移能力。為此,「感知測試」引入了11.6k個現實世界視頻,平均長度為23秒,旨在展示感知上有趣的情況,由全球約100名參與者拍攝。這些視頻密集標註了六種標籤(多選和基於視頻的問答、對象和點軌跡、時間動作和聲音片段),實現了語言和非語言評估。該基準測試的微調和驗證切分已公開提供(CC-BY許可證),另外還提供了一個具有隱藏測試切分的挑戰伺服器。與最先進的視頻問答模型相比,人類基準結果表明在性能上存在顯著差距(91.4% vs 43.6%),這表明在多模態視頻理解方面有很大的改進空間。
數據集、基準代碼和挑戰伺服器可在以下網址找到:https://github.com/deepmind/perception_test
English
We propose a novel multimodal video benchmark - the Perception Test - to
evaluate the perception and reasoning skills of pre-trained multimodal models
(e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus
on computational tasks (e.g. classification, detection or tracking), the
Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and
types of reasoning (descriptive, explanatory, predictive, counterfactual)
across video, audio, and text modalities, to provide a comprehensive and
efficient evaluation tool. The benchmark probes pre-trained models for their
transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
For these purposes, the Perception Test introduces 11.6k real-world videos, 23s
average length, designed to show perceptually interesting situations, filmed by
around 100 participants worldwide. The videos are densely annotated with six
types of labels (multiple-choice and grounded video question-answers, object
and point tracks, temporal action and sound segments), enabling both language
and non-language evaluations. The fine-tuning and validation splits of the
benchmark are publicly available (CC-BY license), in addition to a challenge
server with a held-out test split. Human baseline results compared to
state-of-the-art video QA models show a significant gap in performance (91.4%
vs 43.6%), suggesting that there is significant room for improvement in
multimodal video understanding.
Dataset, baselines code, and challenge server are available at
https://github.com/deepmind/perception_test