OmniResponse:在线多模态对话响应生成于双向互动中
OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
May 27, 2025
作者: Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem
cs.AI
摘要
本文提出了一种新颖的任务——在线多模态对话响应生成(OMCRG),旨在根据说话者的多模态输入,实时生成同步的言语与非言语听众反馈。OMCRG反映了自然双向互动,并在实现生成的音频与听众面部反应之间的同步方面提出了新挑战。为应对这些挑战,我们创新性地引入文本作为中间模态,以桥接音频与面部反应。因此,我们提出了OmniResponse,一种多模态大语言模型(MLLM),能够自回归地生成高质量的多模态听众响应。OmniResponse利用预训练的大语言模型,并增强了两项新组件:Chrono-Text,用于时间锚定生成的文本标记;以及TempoVoice,一个可控的在线文本转语音模块,能生成与面部反应同步的语音。为支持OMCRG的进一步研究,我们发布了ResponseNet,一个包含696个高质量双向互动的新数据集,这些互动配有同步的分屏视频、多通道音频、文字记录及面部行为标注。在ResponseNet上进行的全面评估表明,OmniResponse在语义语音内容、视听同步及生成质量方面显著优于基线模型。
English
In this paper, we introduce Online Multimodal Conversational Response
Generation (OMCRG), a novel task that aims to online generate synchronized
verbal and non-verbal listener feedback, conditioned on the speaker's
multimodal input. OMCRG reflects natural dyadic interactions and poses new
challenges in achieving synchronization between the generated audio and facial
responses of the listener. To address these challenges, we innovatively
introduce text as an intermediate modality to bridge the audio and facial
responses. We hence propose OmniResponse, a Multimodal Large Language Model
(MLLM) that autoregressively generates high-quality multi-modal listener
responses. OmniResponse leverages a pretrained LLM enhanced with two novel
components: Chrono-Text, which temporally anchors generated text tokens, and
TempoVoice, a controllable online TTS module that produces speech synchronized
with facial reactions. To support further OMCRG research, we present
ResponseNet, a new dataset comprising 696 high-quality dyadic interactions
featuring synchronized split-screen videos, multichannel audio, transcripts,
and facial behavior annotations. Comprehensive evaluations conducted on
ResponseNet demonstrate that OmniResponse significantly outperforms baseline
models in terms of semantic speech content, audio-visual synchronization, and
generation quality.Summary
AI-Generated Summary