ChatPaper.aiChatPaper

WorldMark:互動式影片世界模型的統一基準測試套件

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

April 23, 2026
作者: Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, Yongtao Ge
cs.AI

摘要

諸如Genie、YUME、HY-World和Matrix-Game等互動式影片生成模型正快速發展,但每個模型都在各自設有私有場景與軌跡的基準測試中進行評估,導致無法實現公平的跨模型比較。現有公開基準測試雖提供軌跡誤差、美學評分和基於視覺語言模型的判斷等實用指標,卻缺乏能讓這些指標在異質輸入模型間具有可比性的標準化測試條件——包括完全一致的場景、相同的動作序列與統一控制介面。為此,我們推出首個為互動式圖像轉影片世界模型提供共同競技場的基準測試WorldMark,其貢獻包括:(1)統一的動作映射層,將共享的WASD式動作詞彙轉換為各模型原生控制格式,使六大主流模型能在相同場景與軌跡下進行對等比較;(2)分層測試集涵蓋500個評估案例,包含第一人稱與第三人稱視角、寫實與風格化場景,以及從簡單到困難共三個難度級別、時長20-60秒的測試內容;(3)模組化評估工具包,針對視覺品質、控制對齊與世界一致性設計,研究人員既可復用我們的標準化輸入,也能隨技術發展接入自訂指標。我們將公開所有數據、評估代碼與模型輸出以推動未來研究。除離線指標外,我們同步啟動線上平台World Model Arena(warena.ai),讓用戶能並列對決頂尖世界模型並即時查看排行榜動態。
English
Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.
PDF291April 25, 2026