从音频到逼真的实体化：在对话中合成人类

摘要

我们提出了一个框架，用于生成栩栩如生的全身逼真化身，根据二元互动的对话动态进行手势。给定语音音频，我们输出了个体的多种手势运动可能性，包括面部、身体和手部。我们方法的关键在于将来自向量量化的样本多样性优势与通过扩散获得的高频细节相结合，以生成更具动态表现力的运动。我们使用高度逼真的化身可视化生成的运动，能够表达手势中的重要细微差别（例如冷笑和假笑）。为促进这一研究领域，我们引入了一种首创的多视角对话数据集，可用于逼真重建。实验表明，我们的模型生成了适当且多样化的手势，优于仅扩散和仅向量量化方法。此外，我们的感知评估凸显了逼真性（与网格相比）在准确评估对话手势中微妙运动细节方面的重要性。代码和数据集可在线获取。

English

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

从音频到逼真的实体化：在对话中合成人类

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

摘要

Support