Mini-Omni2:朝着具备视觉、语音和双工能力的开源GPT-4o 迈进
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
October 15, 2024
作者: Zhifei Xie, Changqiao Wu
cs.AI
摘要
GPT-4o是一个全面的模型,代表了大型多模态语言模型发展中的一个里程碑。它能够理解视觉、听觉和文本模态,直接输出音频,并支持灵活的双工交互。开源社区的模型通常可以实现GPT-4o的一些功能,比如视觉理解和语音聊天。然而,由于多模态数据的复杂性、复杂的模型架构和训练过程,训练一个融合所有模态的统一模型是具有挑战性的。在本文中,我们介绍了Mini-Omni2,这是一个视听助手,能够实时提供端到端的语音响应以回答视觉和音频查询。通过集成预训练的视觉和听觉编码器,Mini-Omni2在各个模态中保持性能。我们提出了一个三阶段的训练过程来对齐模态,使语言模型能够在有限数据集上训练后处理多模态输入和输出。为了实现交互,我们引入了基于命令的中断机制,可以让用户更灵活地进行交互。据我们所知,Mini-Omni2是对GPT-4o最接近的复制品之一,具有类似的功能形式,我们希望它能为后续研究提供有价值的见解。
English
GPT-4o, an all-encompassing model, represents a milestone in the development
of large multi-modal language models. It can understand visual, auditory, and
textual modalities, directly output audio, and support flexible duplex
interaction. Models from the open-source community often achieve some
functionalities of GPT-4o, such as visual understanding and voice chat.
Nevertheless, training a unified model that incorporates all modalities is
challenging due to the complexities of multi-modal data, intricate model
architectures, and training processes. In this paper, we introduce Mini-Omni2,
a visual-audio assistant capable of providing real-time, end-to-end voice
responses to visoin and audio queries. By integrating pretrained visual and
auditory encoders, Mini-Omni2 maintains performance in individual modalities.
We propose a three-stage training process to align modalities, allowing the
language model to handle multi-modal inputs and outputs after training on a
limited dataset. For interaction, we introduce a command-based interruption
mechanism, enabling more flexible interaction with users. To the best of our
knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have
similar form of functionality, and we hope it can offer valuable insights for
subsequent research.