ChatPaper.aiChatPaper

LLaVA-OneVision-1.5:全面开放框架,推动多模态训练民主化

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

September 28, 2025
作者: Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng
cs.AI

摘要

我们推出LLaVA-OneVision-1.5,这是一个新型的大型多模态模型(LMMs)家族,在显著降低计算和财务成本的同时,实现了业界领先的性能。与现有工作不同,LLaVA-OneVision-1.5提供了一个开放、高效且可复现的框架,用于从零开始构建高质量的视觉-语言模型。LLaVA-OneVision-1.5版本包含三大核心组件:(1)大规模精选数据集:我们构建了一个包含8500万概念平衡的预训练数据集LLaVA-OneVision-1.5-Mid-Traning,以及一个精心策划的2600万指令数据集LLaVA-OneVision-1.5-Instruct,两者共同涵盖了640亿压缩后的多模态标记。(2)高效训练框架:我们开发了一套完整的端到端高效训练框架,采用离线并行数据打包策略,使得LLaVA-OneVision-1.5的训练能在16,000美元预算内完成。(3)顶尖性能表现:实验结果显示,LLaVA-OneVision1.5在广泛的下游任务中展现出极具竞争力的性能。具体而言,LLaVA-OneVision-1.5-8B在27个基准测试中的18个上超越了Qwen2.5-VL-7B,而LLaVA-OneVision-1.5-4B则在全部27个基准测试上均优于Qwen2.5-VL-3B。我们预计不久将发布LLaVA-OneVision-1.5-RL,并鼓励社区持续关注后续更新。
English
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.
PDF243September 30, 2025