RealWonder: 物理アクション条件付きリアルタイム動画生成

要旨

現行のビデオ生成モデルは、3Dシーンに対する作用の影響に関する構造的理解を欠如しているため、力やロボット操作といった3D動作の物理的結果をシミュレートできない。本論文では、単一画像からの動作条件付きビデオ生成において、リアルタイム動作を実現する初のシステム「RealWonder」を提案する。我々の重要な知見は、物理シミュレーションを中間ブリッジとして活用することである。連続的な動作を直接符号化する代わりに、物理シミュレーションを通じてビデオモデルが処理可能な視覚的表現（オプティカルフローとRGB）に変換する。RealWonderは、単一画像からの3D再構成、物理シミュレーション、わずか4回の拡散ステップで動作する蒸留型ビデオ生成器の3要素を統合する。本システムは480x832解像度で13.2 FPSを達成し、剛体、変形体、流体、粒状体に対する力・ロボット動作・カメラ制御の対話的探索を可能とする。RealWonderが没入型体験、AR/VR、ロボット学習におけるビデオモデル応用の新たな可能性を拓くことを期待する。実装コードとモデル重みはプロジェクトWebサイト（https://liuwei283.github.io/RealWonder/）で公開している。

English

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

RealWonder: 物理アクション条件付きリアルタイム動画生成

RealWonder: Real-Time Physical Action-Conditioned Video Generation

要旨

Support