ReactMotion：基于说话者话语生成反应性倾听者动作

摘要

本文提出了一项新任务——基于说话者话语的反应性听者动作生成，旨在生成能恰当回应说话者话语的自然听者身体动作。然而，由于人类反应本质上具有非确定性，对此类非语言听者行为的建模研究仍处于探索阶段且面临挑战。为推进该任务，我们推出了ReactMotionNet大规模数据集，该数据集将说话者话语与多个标注了不同适宜度等级的听者动作候选配对。这种数据集设计显式捕捉了听者行为的一对多特性，提供了超越单一真实标注的监督信号。基于此设计，我们开发了面向偏好的评估方案，专门评估反应适宜度——这一维度被传统关注输入-动作对齐的运动指标所忽略。我们进一步提出ReactMotion统一生成框架，该框架联合建模文本、音频、情感和动作，并通过基于偏好的目标函数进行训练，以鼓励生成既恰当又多样化的听者反应。大量实验表明，ReactMotion在检索基线和级联式LLM流程上均表现更优，能生成更自然、多样且贴合情境的听者动作。

English

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

ReactMotion：基于说话者话语生成反应性倾听者动作

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

摘要

Support