커스텀비디오: 다중 주체를 활용한 텍스트-투-비디오 생성의 맞춤화

초록

맞춤형 텍스트-비디오 생성은 텍스트 프롬프트와 주제 참조를 통해 고품질 비디오를 생성하는 것을 목표로 합니다. 단일 주제를 위해 설계된 기존의 접근 방식은 다중 주제를 다루는 데 어려움을 겪으며, 이는 더 도전적이고 실용적인 시나리오입니다. 본 연구에서는 다중 주제 기반 텍스트-비디오 맞춤화를 촉진하고자 합니다. 우리는 다중 주제의 지도를 통해 정체성을 유지한 비디오를 생성할 수 있는 새로운 프레임워크인 CustomVideo를 제안합니다. 구체적으로, 첫째, 다중 주제의 동시 발생을 촉진하기 위해 단일 이미지 내에서 이를 구성합니다. 또한, 기본 텍스트-비디오 확산 모델을 기반으로, 확산 모델의 잠재 공간에서 서로 다른 주제를 분리하기 위한 간단하지만 효과적인 주의 제어 전략을 설계합니다. 더 나아가, 모델이 특정 객체 영역에 집중할 수 있도록 참조 이미지에서 객체를 분할하고 해당 객체 마스크를 주의 학습에 제공합니다. 또한, 69개의 개별 주제와 57개의 의미 있는 쌍으로 구성된 다중 주제 텍스트-비디오 생성 데이터셋을 종합적인 벤치마크로 수집했습니다. 광범위한 정성적, 정량적 및 사용자 연구 결과는 이전의 최신 접근 방식과 비교하여 우리 방법의 우수성을 입증합니다.

English

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches designed for single subjects suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, we aim to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific object area, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 69 individual subjects and 57 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method, compared with the previous state-of-the-art approaches.

커스텀비디오: 다중 주체를 활용한 텍스트-투-비디오 생성의 맞춤화

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

초록

Support