ReactMotion：從說話者語句生成反應式聆聽者動作

摘要

本文提出一項新任務「基於說話者語句的反應式聆聽者動作生成」，旨在生成能對說話者語句作出恰當回應的自然聆聽者身體動作。然而，由於人類反應本質上具有非確定性，對此類非語言聆聽行為的建模仍處於探索不足且具挑戰性的階段。為推動此任務，我們提出ReactMotionNet大規模數據集，該數據集將說話者語句與多個標註了不同適宜程度的候選聆聽者動作配對。此數據集設計明確捕捉了聆聽行為的一對多特性，並提供超越單一真實動作的監督信號。基於此數據集設計，我們開發了面向偏好的評估方案，專門用於評估反應適宜性——這正是傳統側重輸入-動作對齊的動作指標所忽略的維度。我們進一步提出ReactMotion生成框架，該統一框架能聯合建模文本、音頻、情感與動作，並通過基於偏好的目標函數進行訓練，以激發既適宜又具多樣性的聆聽者反應。大量實驗表明，ReactMotion在生成更自然、多樣且適宜的聆聽者動作方面，優於檢索基線與級聯式LLM流程。

English

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

ReactMotion：從說話者語句生成反應式聆聽者動作

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

摘要

Support