VideoEspresso：細かい粒度のビデオ推論のための大規模な連鎖思考データセットによるコアフレーム選択

要旨

大規模ビジョン言語モデル（LVLMs）の進歩により、マルチモーダル理解が大幅に向上しましたが、ビデオ推論タスクには高品質で大規模なデータセットが不足しているため、課題が残っています。既存のビデオ質問応答（VideoQA）データセットは、しばしば高コストな手動注釈や十分な粒度を持たない自動構築方法に依存しており、冗長なフレーム単位の分析によってスケーラビリティと複雑な推論の効果が制限されています。これらの課題に対処するために、本研究では、VideoQAペアを特徴とするVideoEspressoという新しいデータセットを紹介します。このデータセットは、必要な空間的詳細と時間的一貫性を保持し、中間推論ステップのマルチモーダル注釈も備えています。我々の構築パイプラインは、冗長性を減らすために意味論に基づいた手法を採用し、その後、GPT-4oを使用してQAペアを生成します。さらに、ビデオのChain-of-Thought（CoT）注釈を開発して、推論プロセスを豊かにし、GPT-4oがQAペアとビデオコンテンツから論理関係を抽出するのをサポートします。高品質なVideoQAペアの潜在能力を活用するために、Frame Selectorと2段階の指示微調整推論LVLMを備えたHybrid LVLMs Collaborationフレームワークを提案します。このフレームワークは、コアフレームを選択し、マルチモーダル証拠を使用してCoT推論を実行します。14のタスクに対する提案されたベンチマークで9つの一般的なLVLMに対して評価した結果、我々の手法はほとんどのタスクで既存のベースラインを上回り、優れたビデオ推論能力を示しています。我々のコードとデータセットは以下で公開されます：https://github.com/hshjerry/VideoEspresso

English

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

VideoEspresso：細かい粒度のビデオ推論のための大規模な連鎖思考データセットによるコアフレーム選択

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

要旨

Support