Dream.exe: 動画生成モデルは実行可能なロボット操作を夢に見ることができるか？

要旨

動画生成モデルは、視覚的に魅力的なコンテンツを合成する点で目覚ましい進歩を遂げてきたが、その出力は依然として仮想領域に閉じている。そこで自然と浮かぶ疑問は、こうしたモデルが生成した動画が画面を離れて現実に入り込んだとき、それらは物理世界をどの程度反映しているのかという点である。我々は、ロボット操作をこの疑問に対する具体的かつ測定可能な窓として提案する。すなわち、もしモデルが物理法則を真に内面化しているならば、その描く動作は実行可能なロボット行動に変換可能であるはずである。本論文では、この基準をビデオから実行へのパイプラインによって具体化する評価フレームワーク「Dream.exe」を提案する。シーン画像とタスク記述が与えられると、Dream.exeは操作動画を合成し、生成された動作をロボット軌道に変換し、物理シミュレータ内で実行する。これにより、純粋な視覚評価指標では提供できない接地信号（grounding signal）を得る。本パイプラインを用いて、クローズドソースの先端生成モデル、オープンソースの生成モデル、ロボット特化モデルにわたる8つのモデルを評価した。ベンチマークは、物理的複雑度が異なる3段階にわたる、101個の手作業で厳選された操作タスクをカバーし、視覚品質、軌道忠実度、実行成功率の観点から測定する。注目すべきことに、複数のモデルが測定可能な実行成功率を示し、インターネット規模のデータから学習された生成的先験知識がすでに意味のある物理的知識を符号化していることを示唆している。しかしながら、視覚品質は実行可能性の予測指標としては不十分であり、標準的な視覚評価では捉えられないモデル能力の次元が明らかになった。Dream.exeはhttps://github.com/showlab/Dream.exe でオープンソース化される予定である。

English

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.