音声からフォトリアルな身体表現へ：会話における人間の合成

要旨

我々は、二者間の会話ダイナミクスに従ってジェスチャーを行う全身のフォトリアルなアバターを生成するフレームワークを提案する。音声入力から、個人の顔、身体、手を含む複数のジェスチャー動作の可能性を出力する。本手法の鍵は、ベクトル量子化によるサンプルの多様性と、拡散モデルによる高周波の詳細を組み合わせることで、よりダイナミックで表現力豊かな動作を生成することにある。生成された動作は、重要なジェスチャーのニュアンス（例：冷笑や薄笑い）を表現できる高度にフォトリアルなアバターを用いて可視化する。この研究を促進するため、フォトリアルな再構築を可能にする初のマルチビュー会話データセットを導入する。実験結果は、本モデルが適切で多様なジェスチャーを生成し、拡散モデルやVQのみの手法を上回ることを示している。さらに、知覚評価により、会話ジェスチャーの微妙な動作の詳細を正確に評価する上で、フォトリアリズム（メッシュとの比較）の重要性が明らかになった。コードとデータセットはオンラインで公開されている。

English

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

音声からフォトリアルな身体表現へ：会話における人間の合成

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

要旨

Support