立体世界模型：基于相机引导的立体视频生成

摘要

我们提出StereoWorld——一种相机条件化的立体世界模型，该模型通过联合学习外观与双目几何特性，实现端到端的立体视频生成。与单目RGB或RGBD方法不同，StereoWorld完全基于RGB模态运行，同时直接从视差中构建几何基础。为实现高效且一致的立体生成，我们引入两项核心设计：（1）统一相机坐标系旋转位置编码（RoPE），通过相机感知的旋转位置编码增强潜空间标记，在保持预训练视频先验的基础上，实现相对视角与时间一致的条件控制；（2）立体感知注意力分解机制，将完整的4D注意力拆分为3D视图内注意力与水平行注意力，利用极线几何先验以显著降低的计算量捕捉视差对齐的对应关系。在多项基准测试中，StereoWorld在立体一致性、视差精度和相机运动保真度上均优于强力的“单目生成后转换”流程，生成速度提升3倍以上，视角一致性额外提升5%。除基准测试外，StereoWorld无需深度估计或修补即可实现端到端双目VR渲染，通过度量级深度基础增强具身策略学习，并能兼容长视频蒸馏技术以实现扩展式交互立体合成。

English

We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

立体世界模型：基于相机引导的立体视频生成

Stereo World Model: Camera-Guided Stereo Video Generation

摘要

Support