登月計畫：朝著具多模態條件的可控影片生成與編輯邁進

摘要

大多數現有的影片擴散模型（VDMs）僅限於純文字條件。因此，它們通常缺乏對生成的影片的視覺外觀和幾何結構的控制。本研究提出了Moonshot，一種新的影片生成模型，同時條件於圖像和文字的多模態輸入。該模型建立在一個名為多模態影片區塊（MVB）的核心模塊之上，該模塊包含用於表示影片特徵的傳統空間時間層，以及一個解耦的交叉注意力層，以處理外觀條件的圖像和文字輸入。此外，我們精心設計了模型架構，使其可以選擇性地與預訓練的圖像ControlNet模塊集成，以實現幾何視覺條件，而無需像以前的方法那樣進行額外的訓練開銷。實驗表明，憑藉多功能的多模態條件機制，Moonshot在視覺質量和時間一致性方面相比現有模型實現了顯著改進。此外，該模型可以輕鬆地重新用於各種生成應用，例如個性化影片生成、圖像動畫和影片編輯，揭示了其作為可控影片生成基本架構的潛力。模型將在https://github.com/salesforce/LAVIS 上公開。

English

Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

登月計畫：朝著具多模態條件的可控影片生成與編輯邁進

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

摘要

Support