迈向生成引人入胜语音的类人多模态对话代理

摘要

人類對話涵蓋語言、語音及視覺線索，每一媒介皆提供互補的資訊。例如，語音傳達的氛圍或語調，是僅憑文字無法完全捕捉的。儘管多模態大型語言模型（LLMs）致力於從多樣化輸入生成文本回應，但對於生成自然且引人入勝的語音，關注度相對較少。我們提出了一種擬人化代理，它能根據對話情緒及回應風格資訊生成語音回應。為此，我們構建了一個新穎的「多感官對話」數據集，專注於語音，使代理能夠生成自然的語音。隨後，我們提出了一種基於多模態LLM的模型，用於生成文本回應及語音描述，這些描述用於生成涵蓋副語言資訊的語音。實驗結果證明了在對話中結合視覺與聽覺模態以生成吸引人語音的有效性。源代碼已公開於https://github.com/kimtaesu24/MSenC。

English

Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC

迈向生成引人入胜语音的类人多模态对话代理

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

摘要

Support