全方位回應:在雙向互動中的線上多模態對話回應生成
OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
May 27, 2025
作者: Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem
cs.AI
摘要
本文介紹了在線多模態對話回應生成(Online Multimodal Conversational Response Generation, OMCRG),這是一項旨在根據說話者的多模態輸入,實時生成同步的言語與非言語聽眾反饋的新任務。OMCRG反映了自然的雙人互動,並在實現聽眾生成的音頻與面部反應之間的同步性方面提出了新的挑戰。為應對這些挑戰,我們創新性地引入文本作為中間模態,以橋接音頻與面部反應。基於此,我們提出了OmniResponse,這是一個多模態大語言模型(Multimodal Large Language Model, MLLM),能夠自回歸地生成高質量的多模態聽眾回應。OmniResponse利用了一個預訓練的大語言模型,並增強了兩個新組件:Chrono-Text,它為生成的文本標記提供時間錨定;以及TempoVoice,這是一個可控的在線文本轉語音(TTS)模塊,能夠生成與面部反應同步的語音。為支持進一步的OMCRG研究,我們提出了ResponseNet,這是一個包含696個高質量雙人互動的新數據集,這些互動具有同步的分屏視頻、多通道音頻、轉錄文本以及面部行為註釋。在ResponseNet上進行的全面評估表明,OmniResponse在語義語音內容、音視頻同步性以及生成質量方面顯著優於基準模型。
English
In this paper, we introduce Online Multimodal Conversational Response
Generation (OMCRG), a novel task that aims to online generate synchronized
verbal and non-verbal listener feedback, conditioned on the speaker's
multimodal input. OMCRG reflects natural dyadic interactions and poses new
challenges in achieving synchronization between the generated audio and facial
responses of the listener. To address these challenges, we innovatively
introduce text as an intermediate modality to bridge the audio and facial
responses. We hence propose OmniResponse, a Multimodal Large Language Model
(MLLM) that autoregressively generates high-quality multi-modal listener
responses. OmniResponse leverages a pretrained LLM enhanced with two novel
components: Chrono-Text, which temporally anchors generated text tokens, and
TempoVoice, a controllable online TTS module that produces speech synchronized
with facial reactions. To support further OMCRG research, we present
ResponseNet, a new dataset comprising 696 high-quality dyadic interactions
featuring synchronized split-screen videos, multichannel audio, transcripts,
and facial behavior annotations. Comprehensive evaluations conducted on
ResponseNet demonstrate that OmniResponse significantly outperforms baseline
models in terms of semantic speech content, audio-visual synchronization, and
generation quality.