EWMBench:评估具身世界模型中的场景、运动与语义质量
EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models
May 14, 2025
作者: Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, Guanghui Ren
cs.AI
摘要
近期,创意AI领域的突破使得基于语言指令合成高保真图像和视频成为可能。在此基础上,文本到视频扩散模型已进化为具身世界模型(EWMs),能够从语言命令生成物理上可信的场景,有效连接了具身AI应用中的视觉与行动。本研究聚焦于一个关键挑战:超越通用感知指标来评估EWMs,以确保生成的行为既物理真实又行动一致。我们提出了具身世界模型基准(EWMBench),这是一个专门设计的框架,用于从三个核心维度评估EWMs:视觉场景一致性、运动正确性及语义对齐。我们的方法利用了一个精心策划的数据集,涵盖多样化的场景与运动模式,并配备了一套全面的多维度评估工具包,用以测评和比较候选模型。该基准不仅揭示了现有视频生成模型在满足具身任务独特需求方面的局限,还为该领域未来的发展提供了宝贵的指导。数据集与评估工具已公开于https://github.com/AgibotTech/EWMBench。
English
Recent advances in creative AI have enabled the synthesis of high-fidelity
images and videos conditioned on language instructions. Building on these
developments, text-to-video diffusion models have evolved into embodied world
models (EWMs) capable of generating physically plausible scenes from language
commands, effectively bridging vision and action in embodied AI applications.
This work addresses the critical challenge of evaluating EWMs beyond general
perceptual metrics to ensure the generation of physically grounded and
action-consistent behaviors. We propose the Embodied World Model Benchmark
(EWMBench), a dedicated framework designed to evaluate EWMs based on three key
aspects: visual scene consistency, motion correctness, and semantic alignment.
Our approach leverages a meticulously curated dataset encompassing diverse
scenes and motion patterns, alongside a comprehensive multi-dimensional
evaluation toolkit, to assess and compare candidate models. The proposed
benchmark not only identifies the limitations of existing video generation
models in meeting the unique requirements of embodied tasks but also provides
valuable insights to guide future advancements in the field. The dataset and
evaluation tools are publicly available at
https://github.com/AgibotTech/EWMBench.Summary
AI-Generated Summary