ChatPaper.aiChatPaper

LLaVA-OneVision-1.5:全面开放框架,推动多模态训练的民主化进程

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

September 28, 2025
作者: Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng
cs.AI

摘要

我們推出LLaVA-OneVision-1.5,這是一系列新型的大型多模態模型(LMMs),其在顯著降低計算與財務成本的前提下,達到了業界領先的性能。與現有研究不同,LLaVA-OneVision-1.5提供了一個開放、高效且可重現的框架,用於從零開始構建高質量的視覺-語言模型。LLaVA-OneVision-1.5的發布包含三大核心組件:(1)大規模精選數據集:我們構建了包含8500萬概念平衡的預訓練數據集LLaVA-OneVision-1.5-Mid-Traning,以及精心策劃的2600萬指令數據集LLaVA-OneVision-1.5-Instruct,兩者共涵蓋了640億壓縮多模態標記。(2)高效訓練框架:我們開發了一套完整的端到端高效訓練框架,利用離線並行數據打包策略,使得LLaVA-OneVision-1.5的訓練能在16,000美元的預算內完成。(3)頂尖性能表現:實驗結果顯示,LLaVA-OneVision1.5在廣泛的下游任務中展現出極具競爭力的性能。具體而言,LLaVA-OneVision-1.5-8B在27個基準測試中的18個上超越了Qwen2.5-VL-7B,而LLaVA-OneVision-1.5-4B在所有27個基準測試上均優於Qwen2.5-VL-3B。我們預計不久將發布LLaVA-OneVision-1.5-RL,並鼓勵學術界期待更多更新。
English
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.
PDF243September 30, 2025