ビデオ-MMMU: 複数の専門分野からの知譆獲得の評価

要旨

人間は、情報を認識し、知識を理解し、そして新しい問題を解決するために知識を適応させるという3つの認知段階を通じて知識を獲得します。ビデオは、この学習プロセスにおいて効果的な媒体として機能し、これらの認知段階を進むのを支援します。しかしながら、既存のビデオベンチマークは、大規模多モーダルモデル（LMMs）における知識獲得能力を系統的に評価することに失敗しています。このギャップに対処するために、我々はVideo-MMMUを導入します。これは、ビデオから知識を獲得し活用するLMMsの能力を評価するために設計された多モーダル、多分野のベンチマークです。Video-MMMUには、6つの分野にまたがる300本の専門レベルのビデオと900個の人間によるアノテーション付きの質問が収録されており、知識獲得を認識、理解、適応の段階に沿った質問と回答のペアを通じて評価します。提案された知識獲得メトリクスであるΔknowledgeは、ビデオ視聴後のパフォーマンス向上を数量化します。LMMsの評価は、認知要求が増加するにつれてパフォーマンスが急激に低下し、人間とモデルの知識獲得の間に著しいギャップがあることを浮き彫りにし、LMMsがビデオから学習し適応する能力を向上させる方法の必要性を強調しています。

English

Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.

ビデオ-MMMU: 複数の専門分野からの知譆獲得の評価

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

要旨

Support