ChatPaper.aiChatPaper

反思面向具身世界的视频生成模型

Rethinking Video Generation Model for the Embodied World

January 21, 2026
作者: Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou
cs.AI

摘要

视频生成模型显著推动了具身智能的发展,为生成融合物理世界感知、推理与行动的多样化机器人数据开辟了新可能。然而,合成能准确反映真实机器人交互的高质量视频仍面临挑战,且缺乏标准化基准限制了公平比较与研究进展。为填补这一空白,我们推出综合性机器人基准RBench,通过五大任务域和四种不同具身形态评估面向机器人的视频生成能力。该基准通过可复现的子指标(包括结构一致性、物理合理性和动作完整性)同时评估任务级准确性与视觉保真度。对25个代表性模型的评估揭示了其在生成物理真实机器人行为方面的显著缺陷。此外,该基准与人类评估的斯皮尔曼相关系数达0.96,验证了其有效性。尽管RBench为识别这些缺陷提供了必要视角,但实现物理真实性需超越评估层面,解决高质量训练数据严重短缺的核心问题。基于这些洞察,我们提出精炼的四阶段数据流水线,由此构建的RoVid-X成为目前最大的开源机器人视频生成数据集,包含400万个标注视频片段,覆盖数千项任务并配有全面物理属性标注。这一评估与数据协同的生态系统为视频模型的严谨评估和规模化训练奠定了坚实基础,将加速具身人工智能向通用智能的演进。
English
Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.
PDF360January 23, 2026