WorldMM:面向长视频推理的动态多模态记忆智能体
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
December 2, 2025
作者: Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
cs.AI
摘要
近期视频大语言模型的进展已展现出对短视频片段强大的理解能力。然而,由于上下文容量限制及抽象过程中关键视觉细节的丢失,将其扩展至可处理长达数小时甚至数日视频仍面临巨大挑战。现有基于记忆增强的方法通过利用视频片段的文本摘要来缓解这一问题,但这些方法过度依赖文本,在复杂场景推理时未能有效利用视觉证据。此外,基于固定时间尺度的检索机制进一步限制了其捕捉跨可变时长事件的灵活性。为此,我们提出WorldMM——一种新型多模态记忆智能体,它能构建并检索包含文本与视觉表征的多种互补记忆。WorldMM包含三类记忆:跨多时间尺度索引事实事件的片段记忆、持续更新高层概念知识的语义记忆,以及保留场景细节信息的视觉记忆。在推理过程中,自适应检索智能体会基于查询内容迭代选择最相关的记忆源,并利用多时间粒度进行检索,直至确定已收集足够信息。在五个长视频问答基准测试中,WorldMM显著超越现有基线模型,相较之前最先进方法平均提升8.4%的性能,展现了其在长视频推理任务上的卓越有效性。
English
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.