CT-1：视觉-语言-相机模型将空间推理知识迁移至相机可控视频生成

摘要

相机可控视频生成技术旨在合成具有灵活且物理合理的摄像机运动的视频。然而，现有方法要么通过文本提示提供不精确的相机控制，要么依赖费时费力的人工设定相机轨迹参数，限制了其在自动化场景中的应用。为解决这些问题，我们提出了一种新颖的视觉-语言-相机模型CT-1（相机变换器1），该专用模型通过精确估计相机轨迹，将空间推理知识迁移至视频生成领域。基于视觉语言模块和扩散变换器模型构建的CT-1，在频域采用基于小波变换的正则化损失函数，有效学习复杂的相机轨迹分布。这些轨迹被集成到视频扩散模型中，实现符合用户意图的空间感知相机控制。为支持CT-1的训练，我们设计了专门的数据处理流程，构建了包含超4700万帧的大规模数据集CT-200K。实验结果表明，我们的框架成功弥合了空间推理与视频合成之间的鸿沟，生成忠实且高质量的相机可控视频，并将相机控制精度较现有方法提升25.7%。

English

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

CT-1：视觉-语言-相机模型将空间推理知识迁移至相机可控视频生成

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

摘要

Support