VidCRAFT3：画像から動画への生成のためのカメラ、オブジェクト、およびライティング制御

要旨

最近の画像から動画への生成手法は、カメラの軌跡や物体の動きなど1つまたは2つの視覚要素に対する制御を可能にする成功を示しています。しかし、これらの手法は、データとネットワークの効果に制限があるため、複数の視覚要素に対する制御を提供することができません。本論文では、カメラの動き、物体の動き、および照明方向を同時に制御する画像から動画への生成のための革新的なフレームワークであるVidCRAFT3を紹介します。各視覚要素の制御をより分離するために、照明方向、テキスト、画像を対称的に統合するSpatial Triple-Attention Transformerを提案します。ほとんどの実世界のビデオデータセットには照明の注釈がないため、高品質な合成ビデオデータセットであるVideoLightingDirection（VLD）データセットを構築します。このデータセットには照明方向の注釈と多様な外観の物体が含まれており、VidCRAFT3が強い光の透過や反射効果を効果的に処理できるようになります。さらに、カメラの動き、物体の動き、照明方向の複数の視覚要素に注釈付けされたトレーニングデータが同時に必要ない3段階のトレーニング戦略を提案します。ベンチマークデータセットでの幅広い実験により、VidCRAFT3の効果を示し、制御の粒度と視覚的な一貫性の点で既存の最先端手法を上回る高品質なビデオコンテンツを生成することができることが示されました。すべてのコードとデータは公開されます。プロジェクトページ：https://sixiaozheng.github.io/VidCRAFT3/。

English

Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera trajectory or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. To better decouple control over each visual element, we propose the Spatial Triple-Attention Transformer, which integrates lighting direction, text, and image in a symmetric way. Since most real-world video datasets lack lighting annotations, we construct a high-quality synthetic video dataset, the VideoLightingDirection (VLD) dataset. This dataset includes lighting direction annotations and objects of diverse appearance, enabling VidCRAFT3 to effectively handle strong light transmission and reflection effects. Additionally, we propose a three-stage training strategy that eliminates the need for training data annotated with multiple visual elements (camera motion, object motion, and lighting direction) simultaneously. Extensive experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content, surpassing existing state-of-the-art methods in terms of control granularity and visual coherence. All code and data will be publicly available. Project page: https://sixiaozheng.github.io/VidCRAFT3/.

VidCRAFT3：画像から動画への生成のためのカメラ、オブジェクト、およびライティング制御

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

要旨

Support