WorldCam：以相機姿態為統一幾何表徵的互動式自回歸3D遊戲世界

摘要

近期影片擴散轉換器的突破性進展，使得互動式遊戲世界模型能夠實現用戶在長時間跨度中探索生成環境。然而現有方法在精確動作控制與長時序三維一致性方面仍面臨挑戰。多數現有研究將用戶動作視為抽象條件信號，忽略了動作與三維世界之間的根本幾何耦合關係——即動作會引發相對相機運動，並在三维世界中累積形成全局相機姿態。本文確立相機姿態作為統一幾何表徵，以共同錨定即時動作控制與長期三維一致性。首先，我們基於物理學定義連續動作空間，並以李代數表徵用戶輸入來推導精確的六自由度相機姿態，通過相機嵌入器注入生成模型以確保動作對齊精度。其次，我們將全局相機姿態作為空間索引來檢索過往觀測數據，實現長時序導航中幾何一致的地點重訪。為支持本研究，我們構建了包含3,000分鐘真實人類遊戲過程的大規模數據集，並標註相機軌跡與文本描述。大量實驗表明，本方法在動作可控性、長時序視覺品質與三維空間一致性方面顯著優於現有頂尖互動遊戲世界模型。

English

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

WorldCam：以相機姿態為統一幾何表徵的互動式自回歸3D遊戲世界

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

摘要

Support