WorldPlay: リアルタイムインタラクティブな世界モデリングにおける長期的幾何学的整合性の実現を目指して

要旨

本論文では、ストリーミングビデオ拡散モデル「WorldPlay」を提案する。これはリアルタイムでインタラクティブなワールドモデリングを実現し、長期的な幾何学的一貫性を保持することで、従来手法を制限していた速度とメモリのトレードオフを解決する。WorldPlayは3つの核心的イノベーションによって強化されている。1) デュアルアクション表現を用いて、ユーザーのキーボード・マウス入力に対するロバストな動作制御を実現。2) 長期的な一貫性を確保するため、再構成コンテキストメモリが過去フレームから動的にコンテキストを再構築し、時間的リフレーミングによって幾何学的に重要だが時間的に遠ざかったフレームへのアクセスを維持することで、メモリ減衰を効果的に緩和。3) メモリを考慮したモデル向けに設計された新規蒸留手法「コンテキスト強制」を提案。教師モデルと生徒モデル間でメモリコンテキストを整合させることで、生徒モデルが長距離情報を利用する能力を保持し、リアルタイム速度を実現しながら誤差の累積を防止する。総合的に、WorldPlayは24 FPSで720pの長時間ストリーミングビデオを生成し、優れた一貫性を発揮。既存技術と比較して有利な性能を示し、多様なシーンに対して強力な一般化能力を実証している。プロジェクトページとオンラインデモは以下で公開：https://3d-models.hunyuan.tencent.com/world/ および https://3d.hunyuan.tencent.com/sceneTo3D。

English

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

WorldPlay: リアルタイムインタラクティブな世界モデリングにおける長期的幾何学的整合性の実現を目指して

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

要旨

Support