知覚テスト：マルチモーダル動画モデルの診断ベンチマーク

要旨

我々は、事前学習されたマルチモーダルモデル（例：Flamingo、BEiT-3、GPT-4）の知覚および推論能力を評価するための新しいマルチモーダルビデオベンチマーク「Perception Test」を提案する。既存のベンチマークが計算タスク（例：分類、検出、追跡）に焦点を当てているのに対し、Perception Testは、ビデオ、オーディオ、テキストのモダリティにわたるスキル（記憶、抽象化、物理学、意味論）および推論のタイプ（記述的、説明的、予測的、反事実的）に焦点を当て、包括的かつ効率的な評価ツールを提供する。このベンチマークは、ゼロショット/少数ショットまたは限定的なファインチューニング体制において、事前学習モデルの転移能力を探る。これらの目的のために、Perception Testは、世界中の約100名の参加者によって撮影された、知覚的に興味深い状況を示すように設計された平均23秒の11.6kの実世界のビデオを導入する。これらのビデオは、6種類のラベル（多肢選択式および接地されたビデオ質問応答、オブジェクトおよびポイントトラック、時間的行動および音声セグメント）で密に注釈付けされており、言語および非言語の評価を可能にする。ベンチマークのファインチューニングおよび検証用の分割は、CC-BYライセンスで公開されており、保持されたテスト分割を伴うチャレンジサーバーも利用可能である。最先端のビデオQAモデルと比較した人間のベースライン結果は、性能に大きなギャップがあることを示しており（91.4%対43.6%）、マルチモーダルビデオ理解にはまだ大きな改善の余地があることを示唆している。データセット、ベースラインコード、およびチャレンジサーバーはhttps://github.com/deepmind/perception_testで利用可能である。

English

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 43.6%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_test

知覚テスト：マルチモーダル動画モデルの診断ベンチマーク

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

要旨

Support