白川オムニ技術レポート

要旨

GPT-4oの顕著なマルチモーダル機能とインタラクティブな体験は、実用的な応用における重要な役割を強調していますが、高性能なオープンソースの相当物が不足しています。本論文では、画像、ビデオ、音声、テキストのモダリティを同時に処理および分析し、高度なマルチモーダルなインタラクティブ体験と強力な性能を提供する、初のオープンソース7Bマルチモーダル大規模言語モデル（MLLM）であるBaichuan-Omniを紹介します。我々は、7Bモデルから始まり、オーディオ、画像、ビデオ、テキストのモダリティを横断的に整列させ、マルチタスクのファインチューニングを行う2つの段階を経る効果的なマルチモーダルトレーニングスキーマを提案します。このアプローチにより、言語モデルが視覚と音声データを効果的に処理できるようになります。様々なオムニモーダルおよびマルチモーダルのベンチマークで強力なパフォーマンスを示し、この貢献がマルチモーダル理解とリアルタイムインタラクションの推進においてオープンソースコミュニティに競争力のあるベースラインとなることを目指しています。

English

The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.

白川オムニ技術レポート

Baichuan-Omni Technical Report

要旨

Support