MemDreamer: 계층적 그래프 메모리와 에이전트 기반 검색 메커니즘을 통한 장기 비디오 이해를 위한 인지와 추론의 분리

초록

현재 비전-언어 모델(VLM)은 수 시간 분량의 비디오 처리 시 전체 시각적 시퀀스를 처리하는 과정에서 치명적인 수준의 토큰 폭발과 주의 분산이 발생하여 성능이 저하된다. 이를 극복하기 위해, 본 연구에서는 MemDreamer를 도입하여 인식과 추론을 분리하고, 장시간 비디오 이해를 에이전트 기반 탐색 과정으로 전환한다. 플러그 앤 플레이 방식의 프레임워크로서, 점진적으로 비디오를 스트리밍하여 계층적 그래프 메모리(Hierarchical Graph Memory)를 구축한다. 이는 상향식 3계층 구조의 의미론적 추상화 아키텍처로, 시공간 및 인과 관계를 포착하는 기초 그래프가 핵심을 이룬다. 추론 과정에서 추론 모델은 에이전트 기반 도구 증강 검색을 활용하며, 관찰-추론-행동(Observation-Reason-Action) 루프를 통해 계층 구조를 탐색하고 노드를 검색하며 논리적 엣지를 따라 이동한다. 실험 결과, MemDreamer는 4개 주요 벤치마크에서 최첨단 성능을 달성하여 인간 전문가와의 격차를 단 3.7포인트로 좁혔다. 전체 콘텍스트 입력 대비 추론 컨텍스트 윈도우를 단 2%로 제한하면서도 절대 정확도가 12.5포인트 향상되었다. 또한 통계 분석을 통해 VLM의 논리 추론 성능과 장시간 비디오 이해 벤치마크 간 강한 양의 선형 상관관계를 발견하였으며, 이는 에이전트 역량 확장을 다중 모달 이해의 새로운 패러다임으로 정립한다.

English

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.