ChatPaper.aiChatPaper

無限單應性作為攝影機控制影片生成的穩健調節機制

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

December 18, 2025
作者: Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo
cs.AI

摘要

近期視訊擴散模型的進展,激發了人們對動態場景中相機控制新視角視訊生成的濃厚興趣,旨在為創作者提供後製階段的電影級鏡頭控制能力。相機控制視訊生成的關鍵挑戰在於:既要確保生成結果符合指定相機位姿,又需維持視角一致性,並能根據有限觀測資料推斷被遮擋的幾何結構。現有方法主要透過兩種途徑解決此問題:在軌跡-視訊配對資料集上訓練軌跡條件化視訊生成模型,或從輸入視訊估算深度資訊後沿目標軌跡重投影並生成未投影區域。然而,現有方法難以生成兼具相機位姿精準度與高畫質的視訊,主要原因有二:(1) 基於重投影的方法極易受深度估算誤差影響;(2) 現有資料集中相機軌跡多樣性不足限制了學習模型的能力。為突破這些限制,我們提出InfCam——一種無需深度估算、具高相機位姿保真度的相機控制視訊生成框架。該框架整合兩大核心組件:(1) 無限單應性扭曲技術,將3D相機旋轉直接編碼至視訊擴散模型的2D潛在空間中。透過對此無噪聲旋轉資訊進行條件化處理,經端到端訓練預測殘差視差項,實現高精度的相機位姿還原;(2) 資料增強流程,將現有合成多視角資料集轉換為具多樣化軌跡與焦距的序列。實驗結果表明,InfCam在相機位姿準確度與視覺保真度上均超越基準方法,並能有效從合成資料泛化至真實場景資料。專案頁面連結:https://emjay73.github.io/InfCam/
English
Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/
PDF265December 24, 2025