EWMBench:評估具身世界模型中的場景、運動與語義質量
EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models
May 14, 2025
作者: Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, Guanghui Ren
cs.AI
摘要
近期,創意型人工智慧的進展已實現了基於語言指令的高保真圖像與視頻合成。在此基礎上,文本到視頻的擴散模型已發展成為具身世界模型(EWMs),能夠從語言命令生成物理上合理的場景,有效連接了具身人工智慧應用中的視覺與行動。本研究針對評估EWMs超越一般感知指標的關鍵挑戰,以確保生成物理基礎紮實且行動一致的行為。我們提出了具身世界模型基準(EWMBench),這是一個專為評估EWMs而設計的框架,基於三個關鍵方面:視覺場景一致性、運動正確性及語義對齊。我們的方法利用精心策劃的數據集,涵蓋多樣化的場景與運動模式,並配備全面的多維度評估工具包,來評估與比較候選模型。該基準不僅揭示了現有視頻生成模型在滿足具身任務獨特需求方面的局限,還為指導該領域未來發展提供了寶貴見解。數據集與評估工具已公開於https://github.com/AgibotTech/EWMBench。
English
Recent advances in creative AI have enabled the synthesis of high-fidelity
images and videos conditioned on language instructions. Building on these
developments, text-to-video diffusion models have evolved into embodied world
models (EWMs) capable of generating physically plausible scenes from language
commands, effectively bridging vision and action in embodied AI applications.
This work addresses the critical challenge of evaluating EWMs beyond general
perceptual metrics to ensure the generation of physically grounded and
action-consistent behaviors. We propose the Embodied World Model Benchmark
(EWMBench), a dedicated framework designed to evaluate EWMs based on three key
aspects: visual scene consistency, motion correctness, and semantic alignment.
Our approach leverages a meticulously curated dataset encompassing diverse
scenes and motion patterns, alongside a comprehensive multi-dimensional
evaluation toolkit, to assess and compare candidate models. The proposed
benchmark not only identifies the limitations of existing video generation
models in meeting the unique requirements of embodied tasks but also provides
valuable insights to guide future advancements in the field. The dataset and
evaluation tools are publicly available at
https://github.com/AgibotTech/EWMBench.Summary
AI-Generated Summary