MR. Video: "맵리듀스(MapReduce)"는 장영상 이해를 위한 원칙입니다

초록

우리는 MR. Video를 제안합니다. 이는 긴 비디오 이해를 위한 에이전트 기반 프레임워크로, 간단하지만 효과적인 MapReduce 원칙을 활용하여 긴 비디오를 처리합니다: (1) Map: 짧은 비디오 클립을 독립적이고 밀도 있게 인지하고, (2) Reduce: 모든 클립에서 정보를 공동으로 집계합니다. 시퀀스-투-시퀀스 비전-언어 모델(VLMs)과 비교할 때, MR. Video는 컨텍스트 길이에 제한받지 않고 세밀한 짧은 비디오 인지를 수행합니다. 기존의 비디오 에이전트들이 일반적으로 순차적인 키 세그먼트 선택에 의존하는 것과 달리, Map 작업은 더 간단하고 확장 가능한 시퀀스 병렬 인지를 통해 짧은 비디오 세그먼트를 처리합니다. Reduce 단계는 더 포괄적인 컨텍스트 집계와 추론을 가능하게 하여 명시적인 키 세그먼트 검색을 능가합니다. 이 MapReduce 원칙은 VLMs와 비디오 에이전트 모두에 적용 가능하며, 우리는 LLM 에이전트를 사용하여 그 효과를 검증합니다. 실제로 MR. Video는 두 단계의 MapReduce를 사용합니다: (A) 캡셔닝: 짧은 비디오 클립에 대한 캡션을 생성하고(map), 반복되는 캐릭터와 객체를 공유 이름으로 표준화합니다(reduce); (B) 분석: 각 사용자 질문에 대해 개별 짧은 비디오에서 관련 정보를 분석하고(map), 이를 통합하여 최종 답변을 생성합니다(reduce). MR. Video는 최첨단 VLMs와 비디오 에이전트에 비해 도전적인 LVBench에서 10% 이상의 정확도 향상을 달성합니다. 코드는 다음에서 확인할 수 있습니다: https://github.com/ziqipang/MR-Video

English

We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence-to-sequence vision-language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state-of-the-art VLMs and video agents. Code is available at: https://github.com/ziqipang/MR-Video