Dream.exe: 비디오 생성 모델이 실행 가능한 로봇 조작을 꿈꿀 수 있을까?

초록

비디오 생성 모델은 시각적으로 설득력 있는 콘텐츠를 합성하는 데 있어 인상적인 진전을 이루었지만, 그 출력은 여전히 가상 영역에 국한되어 있다. 이에 따라 자연스러운 질문이 제기된다: 생성된 비디오가 화면을 벗어나 현실로 들어갈 때, 이러한 모델들은 물리적 세계를 얼마나 잘 반영하는가? 본 연구는 로봇 조작을 이 질문에 대한 구체적이고 측정 가능한 창(window)으로 제안한다. 즉, 모델이 물리 법칙을 진정으로 내재화했다면, 그것이 묘사하는 움직임은 실행 가능한 로봇 행동으로 변환되어야 한다. 우리는 이러한 기준을 비디오-실행 파이프라인을 통해 실질적으로 구현하는 평가 프레임워크인 Dream.exe를 소개한다. Dream.exe는 장면 이미지와 작업 설명이 주어지면 조작 비디오를 합성하고, 생성된 움직임을 로봇 궤적으로 변환한 후, 물리 시뮬레이터에서 이를 실행함으로써 순수 시각적 지표로는 제공할 수 없는 근거 신호(grounding signal)를 산출한다. 이 파이프라인을 사용하여 우리는 최첨단 폐쇄형 소스 생성기, 오픈소스 생성기, 로봇 특화 모델을 아우르는 8개의 모델을 평가한다. 본 벤치마크는 시각적 품질, 궤적 충실도, 실행 성공률로 측정된 세 가지 수준의 물리적 복잡성을 가진 101개의 수동 선별 조작 작업을 포함한다. 고무적으로, 여러 모델이 측정 가능한 실행 성공을 달성했으며, 이는 인터넷 규모 데이터로부터 학습된 생성적 사전 지식(generative priors)이 이미 의미 있는 물리적 지식을 부호화하고 있음을 시사한다. 그러나 시각적 품질은 실행 가능성의 좋은 예측 변수가 아니라는 사실이 드러나, 표준 시각 평가가 포착하지 못하는 모델 능력의 차원을 노출한다. Dream.exe는 https://github.com/showlab/Dream.exe에서 오픈소스로 공개될 예정이다.

English

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.