VLOGGER：用于具身化化身合成的多模态扩散

摘要

我们提出了VLOGGER，这是一种从单个人物输入图像生成音频驱动的人类视频的方法，它基于最近生成扩散模型的成功。我们的方法包括：1）一个随机的人体到3D运动扩散模型，以及2）一种新颖的基于扩散的架构，将文本到图像模型与空间和时间控制相结合。这支持通过人脸和身体的高级表示轻松控制的可变长度高质量视频的生成。与先前的工作相比，我们的方法不需要为每个人进行训练，不依赖于人脸检测和裁剪，生成完整图像（不仅仅是脸部或嘴唇），并考虑到广泛的场景（例如可见的躯干或不同的主体身份），这对正确合成进行交流的人类至关重要。我们还策划了MENTOR，这是一个新的多样化数据集，具有3D姿势和表情注释，比以前的数据集大一个数量级（800,000个身份），并带有动态手势，我们在这个数据集上训练和剔除我们的主要技术贡献。 VLOGGER在三个公共基准测试中表现优于最先进的方法，考虑到图像质量、身份保留和时间一致性，同时生成上半身手势。我们分析了VLOGGER的性能，涉及多个多样性指标，表明我们的架构选择和对MENTOR的使用有助于在规模上训练一个公平和无偏见的模型。最后，我们展示了在视频编辑和个性化方面的应用。

English

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.

VLOGGER：用于具身化化身合成的多模态扩散

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

摘要

Support