MemDreamer: 階層的グラフメモリとエージェント型検索メカニズムによる長尺動画理解のための知覚と推論の分離

要旨

現在のVision-Languageモデルは、数時間に及ぶ動画を扱う際に、完全な映像シーケンスを処理するとトークンの爆発的増加と注意の希薄化が生じるという深刻な問題を抱えています。この課題を克服するため、我々はMemDreamerを提案し、知覚と推論を分離することで、長尺動画理解をエージェント的な探索プロセスに転換します。プラグアンドプレイ型のフレームワークとして、MemDreamerは動画を段階的にストリーミングし、階層的グラフメモリ（Hierarchical Graph Memory）を構築します。これは、時空間的・因果関係を捉えた基盤グラフを軸とする、トップダウン型の3層アーキテクチャであり、意味的抽象化を実現します。推論時には、推論モデルがエージェント的なツール拡張型検索を用い、観察-推論-行動ループ（Observation-Reason-Action loop）を通じて階層間をナビゲートし、ノードを探索し、論理エッジを辿ります。実験の結果、MemDreamerは4つの主要ベンチマークでSOTAを達成し、人間専門家との差をわずか3.7ポイントにまで縮めました。また、推論コンテキストウィンドウを全コンテキスト摂取量のわずか2%に抑えつつ、12.5ポイントの絶対的な精度向上を実現しています。さらに、統計分析により、VLMの論理推論性能と長尺動画理解ベンチマークの間に強い正の線形相関があることが明らかになり、エージェント能力のスケーリングをマルチモーダル理解の新たなパラダイムとして確立しています。

English

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.