Magic-Me: 아이덴티티 특화 비디오 맞춤형 디퓨전

초록

특정 아이덴티티(ID)를 위한 콘텐츠 생성은 생성 모델 분야에서 상당한 관심을 받고 있다. 텍스트-이미지 생성(T2I) 분야에서는 이미지 내 ID를 제어 가능한 주체 기반 콘텐츠 생성이 큰 진전을 이루었다. 그러나 이를 비디오 생성으로 확장하는 연구는 아직 충분히 탐구되지 않았다. 본 연구에서는 단순하면서도 효과적인 주체 ID 제어 비디오 생성 프레임워크인 Video Custom Diffusion(VCD)을 제안한다. VCD는 몇 장의 이미지로 정의된 특정 주체 ID를 기반으로, 초기화 단계에서 ID 정보 추출을 강화하고 프레임 간 상관관계를 주입하여 ID가 크게 보존된 안정적인 비디오 출력을 생성한다. 이를 위해 고품질 ID 보존에 필수적인 세 가지 새로운 컴포넌트를 제안한다: 1) 프롬프트-세그멘테이션을 통해 잘라낸 ID로 학습된 ID 모듈로, ID 정보와 배경 노이즈를 분리하여 더 정확한 ID 토큰 학습을 가능하게 함; 2) 더 나은 프레임 간 일관성을 위한 3D 가우시안 노이즈 사전을 갖춘 텍스트-비디오(T2V) VCD 모듈; 3) 얼굴 디블러링 및 고해상도 비디오 업스케일링을 위한 비디오-비디오(V2V) Face VCD 및 Tiled VCD 모듈. 단순함에도 불구하고, VCD가 선택된 강력한 베이스라인 대비 더 나은 ID를 갖춘 안정적이고 고품질의 비디오를 생성할 수 있음을 검증하기 위해 광범위한 실험을 수행했다. 또한, ID 모듈의 전이 가능성 덕분에 VCD는 공개적으로 사용 가능한 파인튜닝된 텍스트-이미지 모델과도 잘 작동하여 사용성을 더욱 향상시킨다. 코드는 https://github.com/Zhen-Dong/Magic-Me에서 확인할 수 있다.

English

Creating content for a specific identity (ID) has shown significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven content generation has achieved great progress with the ID in the images controllable. However, extending it to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified subject ID defined by a few images, VCD reinforces the identity information extraction and injects frame-wise correlation at the initialization stage for stable video outputs with identity preserved to a large extent. To achieve this, we propose three novel components that are essential for high-quality ID preservation: 1) an ID module trained with the cropped identity by prompt-to-segmentation to disentangle the ID information and the background noise for more accurate ID token learning; 2) a text-to-video (T2V) VCD module with 3D Gaussian Noise Prior for better inter-frame consistency and 3) video-to-video (V2V) Face VCD and Tiled VCD modules to deblur the face and upscale the video for higher resolution. Despite its simplicity, we conducted extensive experiments to verify that VCD is able to generate stable and high-quality videos with better ID over the selected strong baselines. Besides, due to the transferability of the ID module, VCD is also working well with finetuned text-to-image models available publically, further improving its usability. The codes are available at https://github.com/Zhen-Dong/Magic-Me.

Magic-Me: 아이덴티티 특화 비디오 맞춤형 디퓨전

Magic-Me: Identity-Specific Video Customized Diffusion

초록

Support