RealWonder：实时物理动作条件化视频生成

摘要

当前视频生成模型因缺乏对动作如何影响三维场景的结构化理解，无法模拟三维动作的物理效应（如力学作用与机器人操控）。我们推出RealWonder系统，首次实现基于单张图像的实时动作条件视频生成。核心创新在于以物理模拟为中间桥梁：通过将连续动作转化为视频模型可处理的光流与RGB视觉表征，而非直接编码动作。RealWonder集成三大模块：单图像三维重建、物理模拟、以及仅需4步扩散的蒸馏视频生成器。该系统在480x832分辨率下达到13.2帧/秒，支持对刚体、可变形体、流体及颗粒材料进行力学交互、机器人操作与相机控制的实时探索。我们展望RealWonder将为视频模型在沉浸式体验、AR/VR及机器人学习等领域开辟新路径。代码与模型权重已公开于项目网站：https://liuwei283.github.io/RealWonder/

English

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

RealWonder：实时物理动作条件化视频生成

RealWonder: Real-Time Physical Action-Conditioned Video Generation

摘要

Support