感知测试:多模态视频模型的诊断基准
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
May 23, 2023
作者: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
cs.AI
摘要
我们提出了一种新颖的多模态视频基准测试 - 感知测试 - 用于评估预训练的多模态模型(例如Flamingo,BEiT-3或GPT-4)的感知和推理能力。与现有侧重于计算任务(例如分类、检测或跟踪)的基准测试相比,感知测试侧重于技能(记忆、抽象、物理、语义)和推理类型(描述性、解释性、预测性、反事实)跨视频、音频和文本模态,提供了一个全面且高效的评估工具。该基准测试通过零次迁移/少次迁移或有限微调制度,对预训练模型的迁移能力进行探究。为此,感知测试引入了11.6k个现实世界视频,平均长度为23秒,旨在展示感知上有趣的情境,由全球约100名参与者拍摄。这些视频被密集注释为六种类型的标签(多项选择和基于视频的问题回答、物体和点跟踪、时间动作和声音片段),实现了语言和非语言评估。基准测试的微调和验证数据集公开可用(CC-BY许可),另外还提供了一个具有隐式测试数据集的挑战服务器。与最先进的视频问答模型相比,人类基准结果显示了显著的性能差距(91.4%对43.6%),表明在多模态视频理解方面有很大的改进空间。
数据集、基线代码和挑战服务器可在以下网址获取:https://github.com/deepmind/perception_test
English
We propose a novel multimodal video benchmark - the Perception Test - to
evaluate the perception and reasoning skills of pre-trained multimodal models
(e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus
on computational tasks (e.g. classification, detection or tracking), the
Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and
types of reasoning (descriptive, explanatory, predictive, counterfactual)
across video, audio, and text modalities, to provide a comprehensive and
efficient evaluation tool. The benchmark probes pre-trained models for their
transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
For these purposes, the Perception Test introduces 11.6k real-world videos, 23s
average length, designed to show perceptually interesting situations, filmed by
around 100 participants worldwide. The videos are densely annotated with six
types of labels (multiple-choice and grounded video question-answers, object
and point tracks, temporal action and sound segments), enabling both language
and non-language evaluations. The fine-tuning and validation splits of the
benchmark are publicly available (CC-BY license), in addition to a challenge
server with a held-out test split. Human baseline results compared to
state-of-the-art video QA models show a significant gap in performance (91.4%
vs 43.6%), suggesting that there is significant room for improvement in
multimodal video understanding.
Dataset, baselines code, and challenge server are available at
https://github.com/deepmind/perception_test