MR. Video："MapReduce"是长视频理解的核心原则

摘要

我们提出MR. Video，一种主动式长视频理解框架，该框架展示了处理长视频时简单却高效的MapReduce原则：（1）Map：独立且密集地感知短视频片段；（2）Reduce：联合聚合所有片段的信息。与序列到序列的视觉语言模型（VLMs）相比，MR. Video能够进行细致的短视频感知，不受上下文长度限制。相较于现有通常依赖顺序关键片段选择的视频代理，Map操作实现了更简单、可扩展性更强的短视频段并行感知。其Reduce步骤则支持更全面的上下文聚合与推理，超越了显式关键片段检索。这一MapReduce原则既适用于VLMs也适用于视频代理，我们利用LLM代理验证了其有效性。实践中，MR. Video采用两个MapReduce阶段：（A）字幕生成：为短视频片段生成描述（map），随后将重复出现的角色和对象标准化为统一名称（reduce）；（B）分析：针对每个用户问题，从单个短视频中分析相关信息（map），并将其整合成最终答案（reduce）。在具有挑战性的LVBench上，MR. Video相比最先进的VLMs和视频代理，实现了超过10%的准确率提升。代码已发布于：https://github.com/ziqipang/MR-Video

English

We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence-to-sequence vision-language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state-of-the-art VLMs and video agents. Code is available at: https://github.com/ziqipang/MR-Video

MR. Video："MapReduce"是长视频理解的核心原则

MR. Video: "MapReduce" is the Principle for Long Video Understanding

摘要

Support