仮想ゲームから現実世界の遊びへ

要旨

本論文では、ユーザ制御信号からインタラクティブな映像生成を可能にするニューラルネットワークベースの現実世界ゲームエンジン「RealPlay」を紹介する。従来のゲーム風ビジュアルに焦点を当てた研究とは異なり、RealPlayは現実世界の映像に似たフォトリアリスティックで時間的に一貫性のある映像シーケンスを生成することを目指す。本エンジンはインタラクティブなループで動作し、ユーザは生成されたシーンを観察し、制御コマンドを発行し、それに応答する短い映像チャンクを受け取る。このようなリアルで応答性の高い生成を実現するため、低遅延フィードバックのための反復的チャンク単位予測、反復間の時間的一貫性、正確な制御応答といった主要な課題に取り組む。RealPlayは、ラベル付きゲームデータとラベルなし現実世界映像の組み合わせで学習され、現実世界のアクションアノテーションを必要としない。特に、2つの形式の一般化が観察される：(1) 制御転送—RealPlayは仮想シナリオから現実世界シナリオへの制御信号を効果的にマッピングする；(2) エンティティ転送—学習ラベルはカーレースゲームに由来するが、RealPlayは車両を超えて自転車や歩行者など多様な現実世界エンティティの制御に一般化する。プロジェクトページは以下で確認できる：https://wenqsun.github.io/RealPlay/

English

We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: https://wenqsun.github.io/RealPlay/

仮想ゲームから現実世界の遊びへ

From Virtual Games to Real-World Play

要旨

Support