VideoMind: 長尺動画推論のためのChain-of-LoRAエージェント

要旨

動画はその独特な時間的次元を有しており、回答が視覚的で解釈可能な証拠に直接結びつく、正確な根拠に基づいた理解を必要とします。大規模言語モデルにおける推論能力の著しい進展にもかかわらず、特に動画を対象としたマルチモーダル推論は未開拓の領域です。本研究では、時間的根拠に基づく動画理解のために設計された新たな動画-言語エージェント、VideoMindを紹介します。VideoMindは以下の2つの主要な革新を組み込んでいます：(i) 動画の時間的推論に不可欠な能力を特定し、役割ベースのエージェントワークフローを開発しました。これには、異なる役割を調整するプランナー、時間的ローカライゼーションを行うグラウンダー、時間間隔の正確性を評価する検証者、そして質問応答を行うアンサラーが含まれます。(ii) これらの多様な役割を効率的に統合するために、軽量なLoRAアダプターを介したシームレスな役割切り替えを可能にする新たなChain-of-LoRA戦略を提案し、複数モデルのオーバーヘッドを回避することで効率性と柔軟性のバランスを実現しました。14の公開ベンチマークでの広範な実験により、当エージェントが、根拠に基づく動画質問応答3種、動画時間的ローカライゼーション6種、一般的な動画質問応答5種を含む多様な動画理解タスクにおいて最先端の性能を達成し、動画エージェントおよび長時間的推論の進展における有効性が実証されました。

English

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including 3 on grounded video question-answering, 6 on video temporal grounding, and 5 on general video question-answering, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

VideoMind: 長尺動画推論のためのChain-of-LoRAエージェント

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

要旨

Support