MBench:視頻世界模型記憶能力的全面基準
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
June 8, 2026
作者: Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, Hao Liu, Chen Li, Jing Lyu, Yueqi Duan
cs.AI
摘要
近期基于视频的世界模型研究取得了显著进展,展现出合成高保真视觉序列的前所未有能力。然而,在视觉上合理的视频生成与世界模型的功能需求之间仍存在根本性差距,尤其是在长期时间跨度内维持稳定且合理的内部状态方面。现有基准测试主要关注视觉质量、运动连贯性及文本-视频对齐能力,却很大程度上忽视了记忆——这一世界模型在长期时间跨度与复杂交互中保持一致性的核心能力。为弥补这一不足,我们提出了MBench,一个专为量化评估视频世界模型记忆能力而设计的综合性基准测试。我们将视频世界模型的记忆能力系统性地分解为三个层次互补的核心维度:实体一致性、环境一致性与因果一致性,并进一步细化为12个可量化子维度,以实现对长期记忆的全面刻画。该基准测试基于严格筛选的真实拍摄长视频构建,并采用基于规则的量化矩阵与视觉语言模型进行客观全面的一致性评估。对主流先进视频世界模型的广泛评估揭示了现有方法在长期状态保持方面的关键系统性局限,为推进该领域研究提供了标准化基准与清晰的研究方向。
English
Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.