MemDreamer：通过分层图记忆与智能体检索机制解耦感知与推理的长视频理解方法

摘要

当前的视觉语言模型在处理数小时的视频时面临困难，因为处理完整视觉序列会导致标记数量爆炸和注意力稀释。为解决这一问题，我们提出MemDreamer，将感知与推理分离，将长视频理解转化为智能体探索过程。作为一个即插即用框架，它逐步流式处理视频以构建层次图记忆——一种自上而下的三层语义抽象架构，其底层图捕捉时空及因果关系。在推理阶段，推理模型采用基于工具增强的智能体检索，通过观察-推理-行动循环在层次结构中导航、搜索节点并遍历逻辑边。实验表明，MemDreamer在四个主流基准测试上达到最优结果，与人类专家的差距缩小至仅3.7分。它将推理上下文窗口限制在完整内容输入的仅2%，同时实现了12.5个百分点的绝对准确率提升。此外，统计分析揭示了视觉语言模型在逻辑推理与长视频理解基准上的强正向线性相关，将智能体能力扩展确立为多模态理解的新范式。

English

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.