ムーンショット：マルチモーダル条件を用いた制御可能なビデオ生成と編集に向けて

要旨

既存のビデオ拡散モデル（VDM）の多くは、単なるテキスト条件に限定されています。そのため、生成されるビデオの視覚的な外観や幾何学的構造に対する制御が不十分であることが一般的です。本研究では、画像とテキストのマルチモーダル入力を同時に条件とする新しいビデオ生成モデル「Moonshot」を提案します。このモデルは、ビデオ特徴を表現するための従来の時空間層と、外観条件付けのために画像とテキスト入力を処理する分離型クロスアテンション層で構成される「マルチモーダルビデオブロック（MVB）」と呼ばれるコアモジュールを基盤としています。さらに、モデルアーキテクチャを慎重に設計し、事前学習済みの画像ControlNetモジュールを幾何学的視覚条件として統合できるようにしました。これにより、従来の方法とは異なり、追加の学習オーバーヘッドを必要としません。実験結果から、多様なマルチモーダル条件付けメカニズムを備えたMoonshotは、既存のモデルと比較して視覚品質と時間的一貫性において大幅な改善を示しています。さらに、このモデルは、パーソナライズされたビデオ生成、画像アニメーション、ビデオ編集など、さまざまな生成アプリケーションに容易に転用できるため、制御可能なビデオ生成の基本アーキテクチャとしての可能性を秘めています。モデルはhttps://github.com/salesforce/LAVISで公開されます。

English

Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

ムーンショット：マルチモーダル条件を用いた制御可能なビデオ生成と編集に向けて

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

要旨

Support