Dream.exe:视频生成模型能否梦想可执行的机器人操作?
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
June 4, 2026
作者: Rui Zhao, Kaiming Yang, Jifeng Zhu, Siyang Chen, Ziqi Wang, Weijia Wu, Kevin Qinghong Lin, Heng Wang, Mike Zheng Shou
cs.AI
摘要
视频生成模型在合成视觉上引人入胜的内容方面取得了令人瞩目的进展,但其输出仍然局限于虚拟领域。一个自然的问题随之而来:当这些模型生成的视频离开屏幕进入现实世界时,它们在多大程度上反映了物理世界?我们提出将机器人操作作为这一问题的具体且可量化的窗口:若模型真正内化了物理定律,它所描绘的运动应当转化为可执行的机器人行为。我们引入Dream.exe,一个通过视频到执行流程将此标准操作化的评估框架。给定场景图像和任务描述,Dream.exe合成操作视频,将生成的运动转换为机器人轨迹,并在物理模拟器中执行,从而提供纯视觉指标无法给出的基础信号。利用此流程,我们评估了8个模型,涵盖前沿闭源生成器、开源生成器和机器人专用模型。我们的基准测试包括101个精心策划的操作任务,分为三个物理复杂度级别,从视觉质量、轨迹保真度和执行成功率三个方面进行衡量。令人鼓舞的是,多个模型取得了可测量的执行成功率,表明从互联网规模数据中学习到的生成先验已编码了有意义的物理知识。然而,视觉质量并不能很好地预测可执行性,这揭示了标准视觉评估所无法捕捉到的模型能力维度。Dream.exe将在 https://github.com/showlab/Dream.exe 开源。
English
Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.