无限单应性作为相机控制视频生成的鲁棒条件约束
Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
December 18, 2025
作者: Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo
cs.AI
摘要
视频扩散模型的最新进展激发了人们对动态场景相机控制新视角视频生成的日益关注,旨在为创作者提供后期制作中的电影级镜头控制能力。相机控制视频生成的关键挑战在于确保对指定相机位姿的忠实度,同时保持视角一致性,并基于有限观测推断被遮挡的几何结构。现有方法或通过在轨迹-视频配对数据集上训练轨迹条件化视频生成模型,或通过从输入视频估计深度以沿目标轨迹重投影并生成未投影区域。然而,现有方法难以生成既忠实于相机位姿又高质量的视频,主要原因有二:(1)基于重投影的方法极易受深度估计误差影响;(2)现有数据集中相机轨迹的有限多样性限制了学习模型的性能。为突破这些局限,我们提出InfCam——一种无需深度估计、具有高位姿忠实度的相机控制视频到视频生成框架。该框架集成两大核心组件:(1)无限单应性扭曲技术,将3D相机旋转直接编码至视频扩散模型的2D潜空间。通过对此无噪声旋转信息进行条件化,经由端到端训练预测残差视差项,以实现高精度的相机位姿忠实度;(2)数据增强流程,将现有合成多视角数据集转换为具有多样化轨迹和焦距的序列。实验结果表明,InfCam在相机位姿精度和视觉保真度上均超越基线方法,并能良好地从合成数据泛化至真实场景。项目页面链接:https://emjay73.github.io/InfCam/
English
Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/