OmniResponse: ダイアディック相互作用におけるオンライン多モーダル会話応答生成

要旨

本論文では、話者のマルチモーダル入力を条件として、同期した言語的および非言語的なリスナーのフィードバックをオンラインで生成することを目的とした新たなタスクであるOnline Multimodal Conversational Response Generation (OMCRG)を紹介する。OMCRGは自然な二者間の相互作用を反映し、生成された音声とリスナーの顔の反応の同期を実現する上で新たな課題を提起する。これらの課題に対処するため、我々は音声と顔の反応を橋渡しする中間モダリティとしてテキストを革新的に導入する。これにより、高品質なマルチモーダルなリスナー応答を自己回帰的に生成するMultimodal Large Language Model (MLLM)であるOmniResponseを提案する。OmniResponseは、事前学習されたLLMを基盤とし、生成されたテキストトークンを時間的に固定するChrono-Textと、顔の反応と同期した音声を生成する制御可能なオンラインTTSモジュールであるTempoVoiceという2つの新たなコンポーネントを活用する。さらに、OMCRG研究を支援するため、同期した分割画面ビデオ、マルチチャンネル音声、文字起こし、および顔の動作アノテーションを含む696の高品質な二者間相互作用からなる新たなデータセットであるResponseNetを提示する。ResponseNetを用いた包括的な評価により、OmniResponseが意味的な音声内容、視聴覚同期、および生成品質の点でベースラインモデルを大幅に上回ることが実証された。

English

In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.

OmniResponse: ダイアディック相互作用におけるオンライン多モーダル会話応答生成

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

要旨

Support