一般化されたキーフレーム抽出によるビデオQAとビデオ誘導エージェントタスクの橋渡し

要旨

ビデオ理解はマルチモーダル知能にとって基本的な能力であり、近年のマルチモーダル大規模言語モデル（Multimodal Large Language Models, MLLMs）はビデオ質問応答（Video Question Answering, VideoQA）ベンチマークにおいて顕著な性能を達成している。しかし、既存のベンチマークは主にモデルが浅い視覚的手がかりを知覚できるかを評価するものであり、MLLMsがビデオチュートリアルからより深い知識や手続きスキルを学習し、それらを下流の長期的エージェントタスクに一般化できるかどうかを検証することはほとんどない。このギャップに対処するため、我々はVG-GUIBench（Video-Guided GUI Benchmark）を導入する。これはMLLMベースのGUIエージェントがビデオチュートリアルに従って対応するGUI対話タスクを完了できるかを評価する新しいベンチマークである。さらに、VideoQAとビデオガイドエージェントタスクの両方におけるモデルの性能が、効果的なキーフレーム抽出に決定的に依存していることを観察する。この観察に基づき、我々はTASKER（Task-driven And Scene-aware Keyframe searchER）を提案する。これはタスク関連性とシーン動態を共同で考慮し、情報フレームを識別するキーフレーム抽出アルゴリズムである。実験結果は、TASKERがVideoQAとビデオガイドエージェントタスクの両方のベンチマークで有意な性能向上を達成し、EgoSchema fullsetで最良ベースラインを2.0%、NExT-QAデータセットで1.8%それぞれ上回ることを示している。これらの結果はさらに、ビデオ理解タスクにおける一般化されたキーフレーム抽出手法の可能性を強調している。我々のコードとデータはhttps://github.com/VG-GUI-TASKER/VG-GUI-TASKERで入手可能である。

English

Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (VideoQA) benchmarks. However, existing benchmarks primarily evaluate whether models can perceive shallow visual cues, while rarely examining whether MLLMs can learn deeper knowledge or procedural skills from video tutorials and generalize them to downstream long-horizon agentic tasks. To address this gap, we introduce VG-GUIBench (Video-Guided GUI Benchmark), a new benchmark designed to evaluate whether MLLM-based GUI agents can follow video tutorials to complete corresponding GUI interactive tasks. Furthermore, we observe that the performance of models on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this observation, we propose TASKER (Task-driven And Scene-aware Keyframe searchER), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics to identify informative frames. Experimental results demonstrate that TASKER achieves significant performance improvements on both VideoQA and video-guided agentic task benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on the NExT-QA dataset, respectively. These results further highlight the potential of generalized keyframe extraction methods for video understanding tasks. Our code and data are available at https://github.com/VG-GUI-TASKER/VG-GUI-TASKER.