迈向生成引人入胜语音的类人多模态对话代理

摘要

人类对话涉及语言、语音及视觉线索，每种媒介都提供了互补的信息。例如，语音传达的氛围或语气是纯文本无法完全捕捉的。尽管多模态大语言模型（LLMs）致力于从多样化的输入中生成文本回应，但生成自然且引人入胜的语音方面却较少受到关注。我们提出了一种拟人化代理，它能根据对话情绪和响应风格信息生成语音回应。为此，我们构建了一个新颖的专注于语音的多感官对话数据集（MultiSensory Conversation），使代理能够生成自然的语音。随后，我们提出了一种基于多模态LLM的模型，用于生成文本回应及语音描述，这些描述进而用于生成涵盖副语言信息的语音。实验结果证明了在对话中结合视觉与音频模态以生成吸引人语音的有效性。源代码可在https://github.com/kimtaesu24/MSenC获取。

English

Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC

迈向生成引人入胜语音的类人多模态对话代理

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

摘要

Support