MR. Video: 「MapReduce」は長尺動画理解のための原理である

要旨

我々はMR. Videoを提案する。これは、長尺動画理解のためのエージェント型フレームワークであり、長尺動画処理におけるシンプルでありながら効果的なMapReduce原理を実証するものである：(1) Map: 短い動画クリップを独立して密に知覚し、(2) Reduce: 全てのクリップから情報を共同で集約する。シーケンス・ツー・シーケンスの視覚言語モデル（VLM）と比較して、MR. Videoはコンテキスト長に制限されることなく詳細な短尺動画知覚を行う。既存の動画エージェントが通常順次的なキーセグメント選択に依存するのに対し、Map操作はよりシンプルでスケーラブルな短尺動画セグメントの並列知覚を可能にする。そのReduceステップは、明示的なキーセグメント検索を超える、より包括的なコンテキスト集約と推論を可能にする。このMapReduce原理はVLMと動画エージェントの両方に適用可能であり、我々はLLMエージェントを用いてその有効性を検証する。実際には、MR. Videoは2つのMapReduceステージを採用する：(A) キャプション生成: 短尺動画クリップのキャプションを生成し（map）、繰り返されるキャラクターやオブジェクトを共有名に標準化する（reduce）。(B) 分析: 各ユーザー質問に対して、個々の短尺動画から関連情報を分析し（map）、それらを最終的な回答に統合する（reduce）。MR. Videoは、最先端のVLMや動画エージェントと比較して、難易度の高いLVBenchにおいて10%以上の精度向上を達成する。コードは以下で公開されている: https://github.com/ziqipang/MR-Video

English

We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence-to-sequence vision-language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state-of-the-art VLMs and video agents. Code is available at: https://github.com/ziqipang/MR-Video

MR. Video: 「MapReduce」は長尺動画理解のための原理である

MR. Video: "MapReduce" is the Principle for Long Video Understanding

要旨

Support