EMOVA:赋予语言模型看、听和表达生动情感的能力
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
September 26, 2024
作者: Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hong, Lu Hou, Hang Xu
cs.AI
摘要
GPT-4o是一种全模态模型,可以进行带有多样情感和语调的语音对话,标志着全模态基础模型的一个里程碑。然而,让大型语言模型能够端到端地感知和生成图像、文本和语音,且使用公开可用数据,在开源社区中仍然具有挑战性。现有的视觉-语言模型依赖于外部工具进行语音处理,而语音-语言模型仍然缺乏有限甚至没有视觉理解能力。为了解决这一差距,我们提出了EMOVA(情感全在场语音助手),以赋予大型语言模型端到端的语音能力,同时保持领先的视觉-语言性能。通过语义-声学解耦的语音分词器,我们惊讶地发现,全模态对齐相较于相应的双模态对齐模型,可以进一步增强视觉-语言和语音能力。此外,我们提出了一种轻量级样式模块,用于灵活控制语音风格(例如情感和音高)。EMOVA首次在视觉-语言和语音基准测试中实现了最先进的性能,同时支持具有生动情感的全模态口语对话。
English
GPT-4o, an omni-modal model that enables vocal conversations with diverse
emotions and tones, marks a milestone for omni-modal foundation models.
However, empowering Large Language Models to perceive and generate images,
texts, and speeches end-to-end with publicly available data remains challenging
in the open-source community. Existing vision-language models rely on external
tools for the speech processing, while speech-language models still suffer from
limited or even without vision-understanding abilities. To address this gap, we
propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large
Language Models with end-to-end speech capabilities while maintaining the
leading vision-language performance. With a semantic-acoustic disentangled
speech tokenizer, we notice surprisingly that omni-modal alignment can further
enhance vision-language and speech abilities compared with the corresponding
bi-modal aligned counterparts. Moreover, a lightweight style module is proposed
for flexible speech style controls (e.g., emotions and pitches). For the first
time, EMOVA achieves state-of-the-art performance on both the vision-language
and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue
with vivid emotions.Summary
AI-Generated Summary