從音訊到逼真化身：在對話中合成人類

摘要

我們提出了一個框架，用於生成栩栩如生的全身逼真化身，根據雙向互動的對話動態進行手勢。給定語音音頻，我們輸出個人的多種手勢運動可能性，包括臉部、身體和手部。我們方法的關鍵在於將向量量化的樣本多樣性優勢與通過擴散獲得的高頻細節相結合，以生成更具動態和表現力的運動。我們使用高度逼真的化身來視覺化生成的運動，這些化身可以表達手勢中的關鍵細微差異（例如冷笑和假笑）。為了促進這一研究領域，我們介紹了一個首創的多視角對話數據集，可以進行逼真的重建。實驗表明，我們的模型生成了適當且多樣的手勢，優於僅使用擴散或向量量化的方法。此外，我們的感知評估凸顯了逼真性（相對於網格）在準確評估對話手勢中微妙運動細節方面的重要性。代碼和數據集在線提供。

English

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

從音訊到逼真化身：在對話中合成人類

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

摘要

Support