感知测试：多模态视频模型的诊断基准

摘要

我们提出了一种新颖的多模态视频基准测试 - 感知测试 - 用于评估预训练的多模态模型（例如Flamingo，BEiT-3或GPT-4）的感知和推理能力。与现有侧重于计算任务（例如分类、检测或跟踪）的基准测试相比，感知测试侧重于技能（记忆、抽象、物理、语义）和推理类型（描述性、解释性、预测性、反事实）跨视频、音频和文本模态，提供了一个全面且高效的评估工具。该基准测试通过零次迁移/少次迁移或有限微调制度，对预训练模型的迁移能力进行探究。为此，感知测试引入了11.6k个现实世界视频，平均长度为23秒，旨在展示感知上有趣的情境，由全球约100名参与者拍摄。这些视频被密集注释为六种类型的标签（多项选择和基于视频的问题回答、物体和点跟踪、时间动作和声音片段），实现了语言和非语言评估。基准测试的微调和验证数据集公开可用（CC-BY许可），另外还提供了一个具有隐式测试数据集的挑战服务器。与最先进的视频问答模型相比，人类基准结果显示了显著的性能差距（91.4%对43.6%），表明在多模态视频理解方面有很大的改进空间。数据集、基线代码和挑战服务器可在以下网址获取：https://github.com/deepmind/perception_test

English

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 43.6%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_test

感知测试：多模态视频模型的诊断基准

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

摘要

Support