Motion-I2V: Konsistente und steuerbare Bild-zu-Video-Generierung mit expliziter Bewegungsmodellierung

papers.abstract

Wir stellen Motion-I2V vor, ein neuartiges Framework für konsistente und kontrollierbare Bild-zu-Video-Generierung (I2V). Im Gegensatz zu früheren Methoden, die die komplexe Bild-zu-Video-Abbildung direkt erlernen, zerlegt Motion-I2V I2V in zwei Stufen mit expliziter Bewegungsmodellierung. Für die erste Stufe schlagen wir einen diffusionsbasierten Bewegungsfeld-Prädiktor vor, der sich auf die Ableitung der Trajektorien der Pixel des Referenzbildes konzentriert. Für die zweite Stufe schlagen wir eine bewegungsaugmentierte temporale Aufmerksamkeit vor, um die begrenzte 1-D temporale Aufmerksamkeit in Video-Latent-Diffusionsmodellen zu verbessern. Dieses Modul kann die Merkmale des Referenzbildes effektiv mit der Führung der vorhergesagten Trajektorien aus der ersten Stufe zu den synthetisierten Frames propagieren. Im Vergleich zu bestehenden Methoden kann Motion-I2V konsistentere Videos erzeugen, selbst bei großen Bewegungen und Blickwinkelvariationen. Durch das Training eines spärlichen Trajektorien-ControlNets für die erste Stufe kann Motion-I2V Benutzern ermöglichen, Bewegungsverläufe und Bewegungsregionen präzise mit spärlichen Trajektorien- und Regionenannotationen zu steuern. Dies bietet mehr Kontrollierbarkeit des I2V-Prozesses als die alleinige Abhängigkeit von textuellen Anweisungen. Darüber hinaus unterstützt die zweite Stufe von Motion-I2V natürlicherweise Zero-Shot-Video-zu-Video-Übersetzung. Sowohl qualitative als auch quantitative Vergleiche demonstrieren die Vorteile von Motion-I2V gegenüber früheren Ansätzen in der konsistenten und kontrollierbaren Bild-zu-Video-Generierung.

English

We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation.

Motion-I2V: Konsistente und steuerbare Bild-zu-Video-Generierung mit expliziter Bewegungsmodellierung

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

papers.abstract

Support