Ola：透過漸進式模態對齊推動全模態語言模型的前沿

摘要

最近在大型語言模型方面的進展，特別是在 GPT-4o 之後，引發了對開發全模態模型的興趣，這些模型能夠理解更多模態。儘管出現了一些開源替代方案，但在性能上仍然明顯落後於專門的單模態模型。本文介紹了一個名為 Ola 的全模態語言模型，與專門模型相比，在圖像、視頻和音頻理解方面取得了競爭性表現。Ola 的核心設計在於其漸進式模態對齊策略，逐步擴展語言模型的支持模態。我們的訓練流程從最不同的模態開始：圖像和文本，然後逐步擴展模型的技能集，使用連接語言和音頻知識的語音數據，以及連接所有模態的視頻數據。漸進式學習流程還使我們能夠保持相對較小的跨模態對齊數據，使從現有的視覺語言模型開發全模態變得容易且成本較低。此外，為了實現像 GPT-4o 這樣的先進互動體驗，我們進一步設計了一種句子級解碼解決方案，用於流式語音生成。大量實驗表明，Ola 在所有模態上均優於現有的開源全模態語言模型，同時與同等大小的最先進專門模型達到了高度競爭性的性能。我們的目標是將 Ola 打造成一個完全開源的全模態理解解決方案，以推動這一新興領域的未來研究。模型權重、代碼和數據已在 https://github.com/Ola-Omni/Ola 上開源。

English

Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

Ola：透過漸進式模態對齊推動全模態語言模型的前沿

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

摘要

Support