모션샷: 다중모달 조건을 통한 제어 가능한 비디오 생성 및 편집 기술

초록

기존의 대부분의 비디오 확산 모델(VDMs)은 단순히 텍스트 조건에만 제한되어 있다. 이로 인해 생성된 비디오의 시각적 외관과 기하학적 구조에 대한 통제력이 부족한 경우가 많다. 본 연구에서는 이미지와 텍스트의 다중모달 입력을 동시에 조건으로 하는 새로운 비디오 생성 모델인 Moonshot을 제안한다. 이 모델은 비디오 특징을 표현하기 위한 기존의 시공간 레이어와 외관 조건을 위한 이미지 및 텍스트 입력을 처리하기 위한 분리된 교차 주의 레이어로 구성된 다중모달 비디오 블록(MVB)을 핵심 모듈로 구축하였다. 또한, 기하학적 시각 조건을 위해 사전 훈련된 이미지 ControlNet 모듈을 추가 훈련 오버헤드 없이 통합할 수 있도록 모델 아키텍처를 신중하게 설계하였다. 실험 결과, 다양한 다중모달 조건 메커니즘을 통해 Moonshot은 기존 모델에 비해 시각적 품질과 시간적 일관성에서 상당한 개선을 보여주었다. 또한, 이 모델은 개인화된 비디오 생성, 이미지 애니메이션, 비디오 편집 등 다양한 생성 응용 프로그램에 쉽게 재사용될 수 있어, 제어 가능한 비디오 생성을 위한 기본 아키텍처로서의 잠재력을 보여준다. 모델은 https://github.com/salesforce/LAVIS에서 공개될 예정이다.

English

Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

모션샷: 다중모달 조건을 통한 제어 가능한 비디오 생성 및 편집 기술

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

초록

Support