WorldMark:交互式视频世界模型的统一基准套件
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
April 23, 2026
作者: Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, Yongtao Ge
cs.AI
摘要
诸如Genie、YUME、HY-World和Matrix-Game等交互式视频生成模型正快速发展,但每个模型都在各自使用私有场景与运动轨迹的基准上进行评估,导致无法实现公平的跨模型比较。现有公共基准虽能提供轨迹误差、美学评分和基于视觉语言模型的评判等有用指标,但均未提供标准化测试条件——包括完全相同的场景、一致的动作序列和统一控制接口——这使得这些指标难以在输入异构的模型间进行可比评估。我们推出WorldMark,首个为交互式图像到视频世界模型提供通用测试平台的基准。WorldMark的贡献包括:(1)统一动作映射层,将共享的WASD式动作词汇转换为各模型原生控制格式,实现在相同场景与轨迹上对六大模型的直接对比;(2)包含500个测试案例的分层评估集,涵盖第一人称与第三人称视角、写实与风格化场景,以及从易到难三个难度级别(时长20-60秒);(3)模块化评估工具包,针对视觉质量、控制对齐和世界一致性设计,允许研究者在复用标准化输入的同时,随领域发展灵活接入自定义指标。我们将公开所有数据、评估代码及模型输出以推动后续研究。除离线指标外,我们还推出World Model Arena在线平台(warena.ai),用户可在此让主流世界模型进行实时对战并查看动态排行榜。
English
Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.