ChatPaper.aiChatPaper

Dream.exe:視訊生成模型能否夢見可執行的機器人操作?

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

June 4, 2026
作者: Rui Zhao, Kaiming Yang, Jifeng Zhu, Siyang Chen, Ziqi Wang, Weijia Wu, Kevin Qinghong Lin, Heng Wang, Mike Zheng Shou
cs.AI

摘要

視頻生成模型在合成視覺上引人入勝的內容方面取得了驚人進展,但其產出仍局限於虛擬領域。一個自然的問題隨之而來:當這些模型生成的影片離開螢幕進入現實世界時,它們能在多大程度上反映物理世界?我們提出將機器人操作視為一個具體、可量化的窗口來探討這個問題:如果一個模型真正內化了物理定律,它所描繪的運動應能轉化為可執行的機器人行為。我們引入了 Dream.exe,這是一個評估框架,通過影片到執行的管線來具體落實此標準。給定場景影像與任務描述,Dream.exe 合成一部操作影片,將生成的運動轉換為機器人軌跡,並在物理模擬器中執行,從而提供純視覺指標無法給出的接地信號。利用此管線,我們評估了八個模型,涵蓋前沿閉源生成器、開源生成器以及專用機器人模型。我們的基準測試涵蓋 101 個人工精心策劃的操作任務,分為三個物理複雜度層級,並從視覺品質、軌跡保真度與執行成功率三個面向衡量。令人鼓舞的是,數個模型達到了可量測的執行成功率,這表明從網路規模資料中學習到的生成先驗已經編碼了有意義的物理知識。然而,視覺品質被證明並非執行效能的良好預測指標,這揭示了標準視覺評估未能捕捉到的模型能力維度。Dream.exe 將在 https://github.com/showlab/Dream.exe 開源。
English
Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.