Motion-I2V: 明示的なモーションモデリングによる一貫性と制御性を備えた画像から動画への生成

要旨

Motion-I2Vを紹介します。これは、一貫性と制御性を備えた新しい画像から動画生成（I2V）フレームワークです。従来の方法が複雑な画像から動画へのマッピングを直接学習するのに対し、Motion-I2VはI2Vを明示的なモーションモデリングを用いて2段階に分解します。第1段階では、拡散モデルに基づくモーションフィールド予測器を提案し、参照画像のピクセルの軌跡を推論することに焦点を当てます。第2段階では、動画潜在拡散モデルにおける限定的な1次元時間的注意を強化するために、モーション拡張時間的注意を提案します。このモジュールは、第1段階で予測された軌跡のガイダンスに基づいて、参照画像の特徴を合成フレームに効果的に伝播させることができます。既存の方法と比較して、Motion-I2Vは大きな動きや視点の変化があっても、より一貫した動画を生成することができます。第1段階のためのスパース軌跡ControlNetを訓練することで、Motion-I2Vはユーザーがスパース軌跡と領域アノテーションを用いてモーション軌跡とモーション領域を精密に制御することを可能にします。これにより、テキスト指示だけに頼るよりもI2Vプロセスの制御性が向上します。さらに、Motion-I2Vの第2段階は、ゼロショットの動画から動画への変換を自然にサポートします。定性的および定量的な比較により、Motion-I2Vが従来のアプローチよりも一貫性と制御性を備えた画像から動画生成において優れていることが示されています。

English

We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation.

Motion-I2V: 明示的なモーションモデリングによる一貫性と制御性を備えた画像から動画への生成

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

要旨

Support