인간과 유사한 다중모드 대화형 에이전트를 위한 매력적인 음성 생성

초록

인간의 대화는 언어, 음성, 시각적 단서를 포함하며, 각 매체는 상호 보완적인 정보를 제공합니다. 예를 들어, 음성은 텍스트만으로는 완전히 포착되지 않는 분위기나 어조를 전달합니다. 다중모드 LLM(대형 언어 모델)이 다양한 입력으로부터 텍스트 응답을 생성하는 데 초점을 맞추는 동안, 자연스럽고 매력적인 음성 생성을 위한 연구는 상대적으로 덜 주목받아 왔습니다. 우리는 대화의 분위기와 반응 스타일 정보를 기반으로 음성 응답을 생성하는 인간과 유사한 에이전트를 제안합니다. 이를 위해, 우리는 에이전트가 자연스러운 음성을 생성할 수 있도록 음성에 초점을 맞춘 새로운 MultiSensory Conversation 데이터셋을 구축했습니다. 그런 다음, 텍스트 응답과 음성 설명을 생성하기 위한 다중모드 LLM 기반 모델을 제안하며, 이는 부언어적 정보를 포함한 음성 생성을 위해 사용됩니다. 실험 결과는 대화에서 시각 및 청각 모달리티를 모두 활용하여 매력적인 음성을 생성하는 데 있어서의 효과성을 입증합니다. 소스 코드는 https://github.com/kimtaesu24/MSenC에서 확인할 수 있습니다.

English

Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC

인간과 유사한 다중모드 대화형 에이전트를 위한 매력적인 음성 생성

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

초록

Support