重新思考具身世界中的影片生成模型
Rethinking Video Generation Model for the Embodied World
January 21, 2026
作者: Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou
cs.AI
摘要
影片生成模型已顯著推進了具身智能的發展,為生成能捕捉物理世界中感知、推理與行動的多樣化機器人數據開闢了新可能。然而,合成能精準反映真實世界機器人互動的高品質影片仍具挑戰性,且缺乏標準化基準限制了公平比較與研究進展。為解決此問題,我們提出綜合性機器人基準RBench,針對五種任務領域與四種不同機器人本體評估機器人導向的影片生成。該基準透過可重現的子指標(包括結構一致性、物理合理性與動作完整性)同時評估任務層級的正確性與視覺逼真度。對25個代表性模型的評估結果顯示,現有模型在生成物理真實的機器人行為方面存在明顯不足。此外,RBench與人工評估的斯皮爾曼相關係數達0.96,驗證其有效性。儘管RBench提供了發現這些缺陷的必要視角,但要實現物理真實性,需超越評估層面以解決高品質訓練數據嚴重短缺的問題。基於此洞察,我們提出精煉的四階段數據流程,構建出RoVid-X——目前最大的開源機器人影片生成數據集,包含400萬個註解影片片段,涵蓋數千種任務並配備全面的物理屬性標註。此評估與數據協同的生態系統,為影片模型的嚴謹評估與可擴展訓練奠定了堅實基礎,加速具身人工智能向通用智能的演進。
English
Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.