LLaVA-OneVision-1.5：全面开放框架，推动多模态训练的民主化进程

摘要

我們推出LLaVA-OneVision-1.5，這是一系列新型的大型多模態模型（LMMs），其在顯著降低計算與財務成本的前提下，達到了業界領先的性能。與現有研究不同，LLaVA-OneVision-1.5提供了一個開放、高效且可重現的框架，用於從零開始構建高質量的視覺-語言模型。LLaVA-OneVision-1.5的發布包含三大核心組件：（1）大規模精選數據集：我們構建了包含8500萬概念平衡的預訓練數據集LLaVA-OneVision-1.5-Mid-Traning，以及精心策劃的2600萬指令數據集LLaVA-OneVision-1.5-Instruct，兩者共涵蓋了640億壓縮多模態標記。（2）高效訓練框架：我們開發了一套完整的端到端高效訓練框架，利用離線並行數據打包策略，使得LLaVA-OneVision-1.5的訓練能在16,000美元的預算內完成。（3）頂尖性能表現：實驗結果顯示，LLaVA-OneVision1.5在廣泛的下游任務中展現出極具競爭力的性能。具體而言，LLaVA-OneVision-1.5-8B在27個基準測試中的18個上超越了Qwen2.5-VL-7B，而LLaVA-OneVision-1.5-4B在所有27個基準測試上均優於Qwen2.5-VL-3B。我們預計不久將發布LLaVA-OneVision-1.5-RL，並鼓勵學術界期待更多更新。

English

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

LLaVA-OneVision-1.5：全面开放框架，推动多模态训练的民主化进程

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

摘要

Support