CT-1：视觉-语言-相机模型将空间推理知识迁移至相机可控视频生成

摘要

相機可控影片生成技術旨在實現靈活且物理真實的攝影機運動合成。然而，現有方法要么通過文本提示提供不精確的攝影機控制，要么依賴耗時的手動軌跡參數設定，限制了其在自動化場景中的應用。為解決這些問題，我們提出創新型視覺-語言-攝影機模型CT-1（攝影機變換器1），該專用模型通過精確估算攝影機軌跡，將空間推理知識遷移至影片生成領域。基於視覺語言模組與擴散變換器模型架構，CT-1在頻域中採用基於小波變換的規整化損失函數，有效學習複雜的攝影機軌跡分佈。這些軌跡被整合至影片擴散模型中，實現符合用戶意圖的空間感知型攝影機控制。為構建CT-1的訓練體系，我們設計專用數據篩選流程並建立CT-200K大規模數據集，包含超過4700萬幀影像。實驗結果表明，本框架成功彌合空間推理與影片合成間的鴻溝，生成忠實反映意圖的高品質可控影片，並將攝影機控制精度相較現有方法提升25.7%。

English

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

CT-1：视觉-语言-相机模型将空间推理知识迁移至相机可控视频生成

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

摘要

Support