DyaDiT：面向社交友好型双人姿态生成的多模态扩散Transformer

摘要

生成逼真的对话手势对于实现与数字人自然、具有社交吸引力的互动至关重要。然而，现有方法通常将单一音频流映射为单说话者的动作，既未考虑社交语境，也未建模对话双方间的互动动态。我们提出DyaDiT——一种多模态扩散变换器，能够从成对音频信号中生成符合语境的人类动作。该模型基于无缝交互数据集训练，通过输入成对音频及可选的社交语境标记，可生成符合情境的动作。它融合双方说话者的信息以捕捉互动动态，采用运动字典编码动作先验，并可选择性利用对话伴侣的手势来生成更具响应性的动作。我们在标准动作生成指标上评估DyaDiT，并开展定量用户研究，证明其不仅在客观指标上超越现有方法，更获得用户显著偏好，凸显了其在社交友好型动作生成方面的鲁棒性。代码与模型将在论文录用后开源。

English

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

DyaDiT：面向社交友好型双人姿态生成的多模态扩散Transformer

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

摘要

Support