百川全向技术报告

摘要

GPT-4o的显著多模态能力和互动体验突显了其在实际应用中的关键作用，然而它缺乏一个高性能的开源对应物。在本文中，我们介绍了Baichuan-Omni，这是第一个开源的7B多模态大型语言模型（MLLM），能够同时处理和分析图像、视频、音频和文本的模态，同时提供先进的多模态互动体验和强大的性能。我们提出了一种有效的多模态训练方案，从7B模型开始，经过两个阶段的多模态对齐和跨音频、图像、视频和文本模态的多任务微调。这种方法使语言模型能够有效处理视觉和音频数据。通过在各种全模态和多模态基准测试中展示出色的性能，我们希望这一贡献能成为开源社区在推进多模态理解和实时交互方面的竞争基准。

English

The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.

百川全向技术报告

Baichuan-Omni Technical Report

摘要

Support