VidCRAFT3:用於圖像轉視頻的相機、物體和燈光控制
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
February 11, 2025
作者: Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu
cs.AI
摘要
最近的圖像到視頻生成方法已經展示出成功,使得可以控制一到兩個視覺元素,例如攝像機軌跡或物體運動。然而,由於數據和網絡效能的限制,這些方法無法提供對多個視覺元素的控制。在本文中,我們介紹了 VidCRAFT3,這是一個新穎的框架,用於精確的圖像到視頻生成,同時實現對攝像機運動、物體運動和照明方向的控制。為了更好地解耦對每個視覺元素的控制,我們提出了空間三重注意力轉換器,它以對稱的方式整合了照明方向、文本和圖像。由於大多數現實世界的視頻數據集缺乏照明標註,我們構建了一個高質量的合成視頻數據集,名為 VideoLightingDirection(VLD)數據集。該數據集包括照明方向標註和外觀多樣的物體,使得 VidCRAFT3 能夠有效處理強光線穿透和反射效應。此外,我們提出了一個三階段訓練策略,消除了需要同時標註多個視覺元素(攝像機運動、物體運動和照明方向)的訓練數據的需求。對基準數據集進行的大量實驗表明,VidCRAFT3 在生成高質量視頻內容方面的有效性,超越了現有的最先進方法,具有更高的控制細節和視覺一致性。所有代碼和數據將公開提供。項目頁面:https://sixiaozheng.github.io/VidCRAFT3/。
English
Recent image-to-video generation methods have demonstrated success in
enabling control over one or two visual elements, such as camera trajectory or
object motion. However, these methods are unable to offer control over multiple
visual elements due to limitations in data and network efficacy. In this paper,
we introduce VidCRAFT3, a novel framework for precise image-to-video generation
that enables control over camera motion, object motion, and lighting direction
simultaneously. To better decouple control over each visual element, we propose
the Spatial Triple-Attention Transformer, which integrates lighting direction,
text, and image in a symmetric way. Since most real-world video datasets lack
lighting annotations, we construct a high-quality synthetic video dataset, the
VideoLightingDirection (VLD) dataset. This dataset includes lighting direction
annotations and objects of diverse appearance, enabling VidCRAFT3 to
effectively handle strong light transmission and reflection effects.
Additionally, we propose a three-stage training strategy that eliminates the
need for training data annotated with multiple visual elements (camera motion,
object motion, and lighting direction) simultaneously. Extensive experiments on
benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing
high-quality video content, surpassing existing state-of-the-art methods in
terms of control granularity and visual coherence. All code and data will be
publicly available. Project page: https://sixiaozheng.github.io/VidCRAFT3/.Summary
AI-Generated Summary