VLOGGER：用於具身化化身合成的多模擴散

摘要

我們提出了 VLOGGER，這是一種從單張人物輸入圖像生成音頻驅動的人類視頻的方法，它建立在最近生成擴散模型的成功基礎上。我們的方法包括：1) 一個隨機的人體到三維運動擴散模型，和 2) 一種新穎的基於擴散的架構，通過空間和時間控制來擴充文本到圖像模型。這支持生成高質量、可變長度的視頻，通過人臉和身體的高層表示輕鬆可控。與以往的工作相比，我們的方法不需要為每個人進行訓練，不依賴於人臉檢測和裁剪，生成完整的圖像（不僅僅是臉部或嘴唇），並考慮到廣泛的場景（例如可見的軀幹或多樣的主題身份），這對正確合成進行溝通的人類至關重要。我們還精心籌劃了 MENTOR，這是一個新的多樣化數據集，具有 3D 姿勢和表情標註，比以前的數據集大一個數量級（800,000 個身份），並帶有動態手勢，我們在這個數據集上訓練和剔除我們的主要技術貢獻。 VLOGGER 在三個公共基準測試中表現優異，考慮到圖像質量、身份保留和時間一致性，同時生成上半身手勢。我們通過多個多樣性指標分析了 VLOGGER 的性能，顯示我們的架構選擇和使用 MENTOR 有助於在規模上訓練一個公平且無偏見的模型。最後，我們展示了在視頻編輯和個性化方面的應用。

English

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.

VLOGGER：用於具身化化身合成的多模擴散

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

摘要

Support