ReactMotion: 화자 발화로부터 반응형 청자 동작 생성

초록

본 논문에서는 화자의 발화에 적절히 반응하는 자연스러운 청자 몸동작을 생성하는 새로운 과제인 '화자 발화 기반 반응적 청자 동작 생성'을 소개한다. 그러나 인간 반응의 본질적으로 비결정적 특성으로 인해 이러한 비언어적 청자 행동을 모델링하는 연구는 여전히 미흡하고 도전적인 과제로 남아 있다. 이를 위해 본 연구에서는 화자 발화와 다양한 적절성 수준으로 주석 처리된 다중 후보 청자 동작을 짝지은 대규모 데이터셋 ReactMotionNet을 제안한다. 이 데이터셋 설계는 청자 행동의 1대 다(one-to-many) 관계를 명시적으로 포착하며, 단일 정답 동작을 넘어서는 감독 정보를 제공한다. 이러한 데이터셋 설계를 기반으로, 기존의 입력-동작 정합도에 집중하는 모션 평가指標가 간과하는 반응적 적절성을 평가하기 위한 선호도 기반 평가 프로토콜을 개발한다. 더 나아가 텍스트, 오디오, 감정, 동작을 통합적으로 모델링하며 선호도 기반 목적함수로 훈련되어 적절하고 다양한 청자 반응을 생성하는 통합 생성 프레임워크 ReactMotion을 제안한다. 폭넓은 실험을 통해 ReactMotion이 검색 기반 베이스라인과 계단형 LLM 기반 파이프라인을 능가하며 보다 자연스럽고 다양하며 적절한 청자 동작을 생성함을 입증한다.

English

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

ReactMotion: 화자 발화로부터 반응형 청자 동작 생성

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

초록

Support