EMOVA：赋予语言模型看、听和表达生动情感的能力

摘要

GPT-4o是一种全模态模型，可以进行带有多样情感和语调的语音对话，标志着全模态基础模型的一个里程碑。然而，让大型语言模型能够端到端地感知和生成图像、文本和语音，且使用公开可用数据，在开源社区中仍然具有挑战性。现有的视觉-语言模型依赖于外部工具进行语音处理，而语音-语言模型仍然缺乏有限甚至没有视觉理解能力。为了解决这一差距，我们提出了EMOVA（情感全在场语音助手），以赋予大型语言模型端到端的语音能力，同时保持领先的视觉-语言性能。通过语义-声学解耦的语音分词器，我们惊讶地发现，全模态对齐相较于相应的双模态对齐模型，可以进一步增强视觉-语言和语音能力。此外，我们提出了一种轻量级样式模块，用于灵活控制语音风格（例如情感和音高）。EMOVA首次在视觉-语言和语音基准测试中实现了最先进的性能，同时支持具有生动情感的全模态口语对话。

English

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

EMOVA：赋予语言模型看、听和表达生动情感的能力

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

摘要

Support