MBench: 비디오 세계 모델의 메모리 능력에 대한 포괄적 벤치마크

초록

비디오 기반 월드 모델의 최근 발전은 고품질 시각적 시퀀스를 합성하는 전례 없는 능력을 보여주었다. 그러나 시각적으로 그럴듯한 비디오 생성과 월드 모델의 기능적 요구 사항, 특히 장기적인 시간 범위에 걸쳐 안정적이고 합리적인 내부 상태를 유지하는 데 있어 근본적인 차이가 여전히 존재한다. 기존 벤치마크는 주로 시각적 품질, 움직임 일관성 및 텍스트-비디오 정렬을 강조하지만, 월드 모델이 장기적 범위와 복잡한 상호작용 전반에 걸쳐 일관성을 유지하는 핵심 기능인 메모리를 대부분 간과한다. 이러한 격차를 해소하기 위해, 우리는 비디오 월드 모델의 메모리 능력을 정량화하고 평가하는 데 특화된 포괄적인 벤치마크인 MBench를 제시한다. 우리는 비디오 월드 모델의 메모리 능력을 세 가지 계층적이고 상호 보완적인 핵심 차원, 즉 개체 일관성, 환경 일관성 및 인과 일관성으로 체계적으로 분해하며, 이는 장기 메모리의 포괄적인 특성화를 위해 12개의 정량화 가능한 하위 차원으로 더욱 세분화된다. 우리의 벤치마크는 엄격하게 선별된 실제 촬영 장기 비디오를 기반으로 구축되었으며, 규칙 기반 정량적 지표와 VLM을 통해 객관적이고 포괄적인 일관성 평가를 가능하게 한다. 최첨단 비디오 월드 모델에 대한 광범위한 평가는 기존 방법의 장기 상태 유지에 있어 중요한 시스템적 한계를 드러내며, 이 분야를 발전시키기 위한 표준화된 벤치마크와 명확한 연구 방향을 제공한다.

English

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.