VLOGGER：身体化アバター合成のためのマルチモーダル拡散モデル

要旨

我々はVLOGGERを提案する。これは、単一の人物画像から音声駆動で人間の動画を生成する手法であり、最近の生成拡散モデルの成功を基盤としている。本手法は、1)確率的人間-3Dモーション拡散モデルと、2)空間的・時間的制御を備えたテキスト-画像モデルを拡張する新しい拡散ベースのアーキテクチャから構成される。これにより、人間の顔や身体の高レベル表現を通じて容易に制御可能な、可変長の高品質動画生成が可能となる。従来の研究とは異なり、本手法は人物ごとの学習を必要とせず、顔検出や切り抜きに依存せず、顔や唇だけでなく完全な画像を生成し、コミュニケーションする人間を正確に合成するために重要な幅広いシナリオ（例えば、胴体が可視であることや多様な被写体のアイデンティティ）を考慮する。また、MENTORという新しい多様なデータセットをキュレーションした。これは3Dポーズと表情のアノテーションを備え、従来のものよりも1桁大きい（800,000のアイデンティティ）規模で、動的なジェスチャーを含み、主要な技術的貢献の学習とアブレーションに使用した。 VLOGGERは、画像品質、アイデンティティ保存、時間的一貫性を考慮した3つの公開ベンチマークにおいて、最先端の手法を上回る性能を示し、上半身のジェスチャーも生成する。我々は、多様性指標に関してVLOGGERの性能を分析し、アーキテクチャの選択とMENTORの使用が、大規模で公平かつ偏りのないモデルの学習に有益であることを示す。最後に、ビデオ編集とパーソナライゼーションにおける応用例を示す。

English

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.

VLOGGER：身体化アバター合成のためのマルチモーダル拡散モデル

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

要旨

Support