CT-1: 視覚-言語-カメラモデルが空間推論知識をカメラ制御可能な映像生成に転移

要旨

カメラ制御可能なビデオ生成は、柔軟かつ物理的に妥当なカメラ運動を伴うビデオの合成を目的とする。しかし、既存手法では、テキストプロンプトからの不正確なカメラ制御か、労力を要する手動のカメラ軌道パラメータへの依存が課題であり、自動化シナリオでの利用が制限されていた。これらの問題を解決するため、我々は空間推論知識をビデオ生成に転移させるために正確なカメラ軌道を推定する専門モデル、CT-1（Camera Transformer 1）と名付けた新規のVision-Language-Cameraモデルを提案する。視覚言語モジュールとDiffusion Transformerモデルを基盤として構築されたCT-1は、周波数領域におけるウェーブレットベース正則化損失を採用し、複雑なカメラ軌道分布を効果的に学習する。これらの軌道はビデオ拡散モデルに統合され、ユーザーの意図に沿った空間認識型カメラ制御を実現する。CT-1の訓練を促進するため、専用のデータキュレーションパイプラインを設計し、4700万フレーム以上を含む大規模データセットCT-200Kを構築した。実験結果は、本フレームワークが空間推論とビデオ合成の間の隔たりを首尾よく埋め、忠実で高品質なカメラ制御可能ビデオを生成し、従来手法比でカメラ制御精度を25.7%向上させることを実証している。

English

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

CT-1: 視覚-言語-カメラモデルが空間推論知識をカメラ制御可能な映像生成に転移

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

要旨

Support