立体世界:几何感知的单目转立体视频生成
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
December 10, 2025
作者: Ke Xing, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Xiaojie Jin, Yao Zhao, Yunchao Wei
cs.AI
摘要
随着XR设备的日益普及,高质量立体视频的需求持续增长,但其制作仍面临成本高昂且易产生伪影的难题。为解决这一问题,我们提出StereoWorld——一个端到端框架,通过改造预训练视频生成器实现高保真单目到立体视频的生成。该框架在基于单目视频输入对模型进行联合条件约束的同时,引入几何感知正则化对生成过程进行显式监督,以确保三维结构保真度。我们还集成了时空分块机制,以实现高效的高分辨率合成。为支持大规模训练与评估,我们构建了包含超1100万帧的高清立体视频数据集,所有帧均按自然人类瞳距(IPD)进行校准。大量实验表明,StereoWorld在视觉保真度与几何一致性方面显著优于现有方法,能生成质量更优的立体视频。项目页面详见:https://ke-xing.github.io/StereoWorld/。
English
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.