MovieCORE: 映画における認知推論

要旨

本論文では、映画コンテンツに対するより深い認知的解釈を探るために設計された新しい映像質問応答（VQA）データセット、MovieCOREを紹介する。既存のデータセットが表面的な理解に焦点を当てているのに対し、MovieCOREは映像素材に特化しながらも、System-2思考を促す質問を重視している。我々は、複数の大規模言語モデル（LLM）を思考エージェントとして活用し、高品質な質問-回答ペアを生成・洗練する革新的なエージェンシック・ブレインストーミング手法を提案する。データセットの品質を評価するため、深さ、思考喚起力、構文的複雑さを測定する一連の認知テストを開発した。また、より深い認知タスクにおけるVQAモデルの性能を評価するための包括的な評価スキームを提案する。既存の映像-言語モデル（VLM）の限界に対処するため、トレーニング後のモデル推論能力を最大25％向上させるエージェンシック・チョイス・エンハンスメント（ACE）モジュールを導入した。本研究は、AIシステムにおける映画理解の進展に貢献し、映画コンテンツに関するより挑戦的でニュアンスのある質問に直面した際の現在のVQAモデルの能力と限界について貴重な知見を提供する。プロジェクトページ、データセット、コードはhttps://joslefaure.github.io/assets/html/moviecore.htmlで公開されている。

English

This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

MovieCORE: 映画における認知推論

MovieCORE: COgnitive REasoning in Movies

要旨

Support