ReactMotion: 話者の発話から反応的なリスナーの動きを生成

要旨

本論文では、話者の発話に適切に応答する自然なリスナーの身体動作を生成する新たなタスク「発話に基づく反応的リスナー動作生成」を提案する。しかし、人間の反応が本質的に非決定論的であるため、このような非言語的リスナー行動のモデル化は未開拓で困難な課題である。本タスクを促進するため、我々はReactMotionNetを提示する。これは話者の発話を、適切さの度合いが注釈付けられた複数の候補リスナー動作と対応づけた大規模データセットである。このデータセット設計は、リスナー行動の一対多の性質を明示的に捉え、単一の正解動作を超える監督を提供する。このデータセット設計に基づき、従来の入力-動作の一致に焦点を当てた動作評価指標が無視してきた反応的適切さを評価するために、選好指向の評価プロトコルを開発する。さらに我々は、テキスト・音声・感情・動作を統合的にモデル化し、選好に基づく目的関数で訓練される統合生成フレームワークReactMotionを提案する。これにより、適切かつ多様なリスナー応答が促進される。大規模な実験により、ReactMotionが検索ベースラインやカスケード型LLMベースのパイプラインを上回り、より自然で多様かつ適切なリスナー動作を生成できることを示す。

English

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

ReactMotion: 話者の発話から反応的なリスナーの動きを生成

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

要旨

Support