迈向生成引人入胜语音的类人多模态对话代理
Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
September 18, 2025
作者: Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim
cs.AI
摘要
人类对话涉及语言、语音及视觉线索,每种媒介都提供了互补的信息。例如,语音传达的氛围或语气是纯文本无法完全捕捉的。尽管多模态大语言模型(LLMs)致力于从多样化的输入中生成文本回应,但生成自然且引人入胜的语音方面却较少受到关注。我们提出了一种拟人化代理,它能根据对话情绪和响应风格信息生成语音回应。为此,我们构建了一个新颖的专注于语音的多感官对话数据集(MultiSensory Conversation),使代理能够生成自然的语音。随后,我们提出了一种基于多模态LLM的模型,用于生成文本回应及语音描述,这些描述进而用于生成涵盖副语言信息的语音。实验结果证明了在对话中结合视觉与音频模态以生成吸引人语音的有效性。源代码可在https://github.com/kimtaesu24/MSenC获取。
English
Human conversation involves language, speech, and visual cues, with each
medium providing complementary information. For instance, speech conveys a vibe
or tone not fully captured by text alone. While multimodal LLMs focus on
generating text responses from diverse inputs, less attention has been paid to
generating natural and engaging speech. We propose a human-like agent that
generates speech responses based on conversation mood and responsive style
information. To achieve this, we build a novel MultiSensory Conversation
dataset focused on speech to enable agents to generate natural speech. We then
propose a multimodal LLM-based model for generating text responses and voice
descriptions, which are used to generate speech covering paralinguistic
information. Experimental results demonstrate the effectiveness of utilizing
both visual and audio modalities in conversation to generate engaging speech.
The source code is available in https://github.com/kimtaesu24/MSenC