MemDreamer：透過層級圖記憶與智能檢索機制將感知與推理解耦以實現長影片理解

摘要

目前的视觉语言模型在处理数小时的长视频时面临困难，因为处理完整视觉序列会导致标记数量爆炸性增长和注意力机制稀释。为了解决这一问题，我们提出MemDreamer，将感知与推理分离，将长视频理解转化为智能体探索过程。作为一个即插即用的框架，它逐步流式传输视频以构建层级图记忆，这是一种自上而下的三层架构，用于语义抽象，其基础图捕捉时空和因果关系。在推理过程中，推理模型采用智能体工具增强的检索，通过观察-推理-行动循环在层级中导航、搜索节点并遍历逻辑边。实验表明，MemDreamer在四个主流基准测试中达到了最先进的性能，将专家与人类之间的差距缩小至仅3.7分。它将推理上下文窗口限制为完整上下文输入的仅2%，同时提供了12.5分的绝对准确率提升。此外，统计分析揭示了视觉语言模型在逻辑推理和长视频理解基准上的性能之间存在强烈的正线性相关，由此确立了智能体能力扩展作为多模态理解的新范式。

English

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.