WorldStereo:通过3D几何记忆连接相机引导视频生成与场景重建
WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
March 2, 2026
作者: Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, Chunchao Guo
cs.AI
摘要
尽管基础视频扩散模型(VDM)的最新进展已取得显著突破,但生成视频在视觉质量优异的同时,从这些输出中重建一致的三维场景仍面临挑战,这主要源于摄像机可控性受限以及不同摄像机轨迹下生成内容的不一致性。本文提出WorldStereo这一新型框架,通过两个专用几何记忆模块搭建起摄像机引导视频生成与三维重建的桥梁。具体而言,全局几何记忆模块通过增量更新的点云注入粗粒度结构先验,同时实现精确的摄像机控制;空间立体记忆模块则利用三维对应关系约束模型的注意力感受野,使其聚焦于记忆库中的细粒度细节。这些组件使WorldStereo能在精确摄像机控制下生成多视角一致的视频,为高质量三维重建提供支持。此外,基于分布匹配蒸馏VDM主干网络的分支控制架构展现出卓越效率,无需联合训练即可实现灵活控制。在摄像机引导视频生成和三维重建基准测试上的大量实验证明了本方法的有效性。值得注意的是,WorldStereo可作为强大的世界模型,无论是从透视图像还是全景图像出发,都能以高保真三维结果处理多样化的场景生成任务。相关模型将予以开源。
English
Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.