DyaDiT: 社会的に好ましい二者間ジェスチャー生成のためのマルチモーダル拡散トランスフォーマー

要旨

現実的な対話動作の生成は、デジタルヒューマンとの自然で社会的に魅力的な相互作用を実現する上で不可欠である。しかし、既存手法の多くは単一の音声ストリームを単一話者の動作にマッピングするもので、社会的文脈の考慮や対話中の二者間の相互ダイナミクスのモデル化がなされていない。本論文では、二者間の音声信号から文脈に適した人間の動作を生成するマルチモーダル拡散トランスフォーマーであるDyaDiTを提案する。Seamless Interaction Datasetで学習したDyaDiTは、二者間音声とオプションの社会文脈トークンを受け取り、文脈に適した動作を生成する。本手法は両話者からの情報を融合して相互作用のダイナミクスを捉え、モーション辞書を用いて動作の事前分布を符号化し、オプションで対話相手のジェスチャーを利用してより応答性の高い動作を生成することができる。標準的な動作生成指標による評価と定量的ユーザスタディを実施し、DyaDiTが客観的指標において既存手法を凌駕するだけでなく、ユーザからも強く選好されることを実証した。これは本手法の頑健性と社会的に好ましい動作生成能力を裏付けるものである。コードとモデルは採択後公開予定である。

English

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

DyaDiT: 社会的に好ましい二者間ジェスチャー生成のためのマルチモーダル拡散トランスフォーマー

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

要旨

Support