VLOGGER: 체화된 아바타 합성을 위한 멀티모달 디퓨전

초록

우리는 최근 생성적 확산 모델의 성공을 기반으로, 단일 인물 이미지로부터 오디오 기반 인간 비디오를 생성하는 VLOGGER 방법을 제안합니다. 우리의 방법은 1) 확률적 인간-3D-모션 확산 모델과 2) 공간적 및 시간적 제어를 통해 텍스트-이미지 모델을 확장하는 새로운 확산 기반 아키텍처로 구성됩니다. 이를 통해 인간의 얼굴과 몸의 고수준 표현을 통해 쉽게 제어 가능한 다양한 길이의 고품질 비디오를 생성할 수 있습니다. 기존 연구와 달리, 우리의 방법은 각 개인별로 학습할 필요가 없으며, 얼굴 검출 및 크롭에 의존하지 않고, 얼굴이나 입술뿐만 아니라 전체 이미지를 생성하며, 의사소통하는 인간을 올바르게 합성하기 위해 중요한 다양한 시나리오(예: 보이는 상체 또는 다양한 주체 신원)를 고려합니다. 또한 우리는 3D 포즈와 표정 주석이 포함된 새로운 다양성 있는 데이터셋인 MENTOR를 구축했으며, 이는 이전 데이터셋보다 한 차원 더 크고(800,000명의 신원) 동적 제스처를 포함하며, 이를 통해 주요 기술 기여를 학습하고 분석합니다. VLOGGER는 이미지 품질, 신원 보존 및 시간적 일관성을 고려한 세 가지 공개 벤치마크에서 최첨단 방법을 능가하며, 상체 제스처도 생성합니다. 우리는 VLOGGER의 성능을 다양한 다양성 지표에 대해 분석하여, 우리의 아키텍처 선택과 MENTOR의 사용이 공정하고 편향되지 않은 대규모 모델 학습에 이점을 준다는 것을 보여줍니다. 마지막으로 비디오 편집 및 개인화에서의 응용 사례를 제시합니다.

English

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.

VLOGGER: 체화된 아바타 합성을 위한 멀티모달 디퓨전

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

초록

Support