WorldCam：以相机位姿为统一几何表征的交互式自回归3D游戏世界

摘要

视频扩散变换器的最新进展催生了交互式游戏世界模型，使得用户能够在扩展时间跨度内探索生成环境。然而，现有方法在精确动作控制和长时序3D一致性方面仍存在不足。多数先前研究将用户动作视为抽象条件信号，忽略了动作与3D世界之间根本的几何耦合关系——即动作引发相对相机运动，最终累积形成3D世界中的全局相机位姿。本文提出将相机位姿作为统一几何表征，以共同支撑即时动作控制与长期3D一致性。首先，我们定义了基于物理的连续动作空间，并在李代数中表示用户输入以推导精确的六自由度相机位姿，通过相机嵌入器将其注入生成模型以确保动作精准对齐。其次，我们采用全局相机位姿作为空间索引来检索相关历史观测数据，实现长时序导航过程中几何一致的位置重访。为支持本研究，我们构建了包含3000分钟真实人类游戏录像的大规模数据集，并标注了相机轨迹与文本描述。大量实验表明，本方法在动作可控性、长时序视觉质量和3D空间一致性方面显著优于当前最先进的交互式游戏世界模型。

English

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

WorldCam：以相机位姿为统一几何表征的交互式自回归3D游戏世界

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

摘要

Support