ChatPaper.aiChatPaper

CT-1:视觉-语言-相机模型将空间推理知识迁移至相机可控视频生成

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

April 10, 2026
作者: Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin, Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

相機可控影片生成技術旨在實現靈活且物理真實的攝影機運動合成。然而,現有方法要么通過文本提示提供不精確的攝影機控制,要么依賴耗時的手動軌跡參數設定,限制了其在自動化場景中的應用。為解決這些問題,我們提出創新型視覺-語言-攝影機模型CT-1(攝影機變換器1),該專用模型通過精確估算攝影機軌跡,將空間推理知識遷移至影片生成領域。基於視覺語言模組與擴散變換器模型架構,CT-1在頻域中採用基於小波變換的規整化損失函數,有效學習複雜的攝影機軌跡分佈。這些軌跡被整合至影片擴散模型中,實現符合用戶意圖的空間感知型攝影機控制。為構建CT-1的訓練體系,我們設計專用數據篩選流程並建立CT-200K大規模數據集,包含超過4700萬幀影像。實驗結果表明,本框架成功彌合空間推理與影片合成間的鴻溝,生成忠實反映意圖的高品質可控影片,並將攝影機控制精度相較現有方法提升25.7%。
English
Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.
PDF11April 14, 2026