百川全方位技術報告

摘要

GPT-4o的顯著多模式能力和互動體驗突顯了其在實際應用中的關鍵作用，然而它缺乏高效的開源對應物。在本文中，我們介紹了Baichuan-Omni，這是第一個開源的7B多模式大型語言模型（MLLM），能夠同時處理和分析圖像、視頻、音頻和文本的模態，同時提供先進的多模式互動體驗和強大的性能。我們提出了一種有效的多模式訓練架構，從7B模型開始，通過兩個階段的多模式對齊和跨音頻、圖像、視頻和文本模態的多任務微調。這種方法使語言模型能夠有效處理視覺和音頻數據。通過在各種全模式和多模式基準測試中展現出強大的性能，我們希望這一貢獻能成為開源社區在推進多模式理解和實時互動方面的競爭基準。

English

The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.