Motion-I2V:具有明確運動建模的一致可控的影像到影片生成
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
January 29, 2024
作者: Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Da, Hongsheng Li
cs.AI
摘要
我們介紹了Motion-I2V,一個新穎的框架,用於一致且可控的影像到影片生成(I2V)。與直接學習複雜的影像到影片映射的先前方法不同,Motion-I2V將I2V分解為兩個階段,並具有明確的運動建模。對於第一階段,我們提出了基於擴散的運動場預測器,專注於推斷參考影像像素的軌跡。對於第二階段,我們提出了運動增強的時間注意力,以增強視頻潛在擴散模型中有限的一維時間注意力。該模塊可以在第一階段預測的軌跡的指導下,有效地將參考影像的特徵傳播到合成幀。與現有方法相比,Motion-I2V可以生成更一致的視頻,即使存在大幅運動和視角變化。通過為第一階段訓練稀疏軌跡ControlNet,Motion-I2V可以支持用戶通過稀疏軌跡和區域標註精確控制運動軌跡和運動區域。這比僅僅依靠文本指令提供了更多對I2V過程的可控性。此外,Motion-I2V的第二階段自然支持零樣本視頻到視頻的轉換。定性和定量比較展示了Motion-I2V在一致且可控的影像到影片生成方面相對於先前方法的優勢。
English
We introduce Motion-I2V, a novel framework for consistent and controllable
image-to-video generation (I2V). In contrast to previous methods that directly
learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into
two stages with explicit motion modeling. For the first stage, we propose a
diffusion-based motion field predictor, which focuses on deducing the
trajectories of the reference image's pixels. For the second stage, we propose
motion-augmented temporal attention to enhance the limited 1-D temporal
attention in video latent diffusion models. This module can effectively
propagate reference image's feature to synthesized frames with the guidance of
predicted trajectories from the first stage. Compared with existing methods,
Motion-I2V can generate more consistent videos even at the presence of large
motion and viewpoint variation. By training a sparse trajectory ControlNet for
the first stage, Motion-I2V can support users to precisely control motion
trajectories and motion regions with sparse trajectory and region annotations.
This offers more controllability of the I2V process than solely relying on
textual instructions. Additionally, Motion-I2V's second stage naturally
supports zero-shot video-to-video translation. Both qualitative and
quantitative comparisons demonstrate the advantages of Motion-I2V over prior
approaches in consistent and controllable image-to-video generation.