LLaVA-OneVision-1.5：全面开放框架，推动多模态训练民主化

摘要

我们推出LLaVA-OneVision-1.5，这是一个新型的大型多模态模型（LMMs）家族，在显著降低计算和财务成本的同时，实现了业界领先的性能。与现有工作不同，LLaVA-OneVision-1.5提供了一个开放、高效且可复现的框架，用于从零开始构建高质量的视觉-语言模型。LLaVA-OneVision-1.5版本包含三大核心组件：（1）大规模精选数据集：我们构建了一个包含8500万概念平衡的预训练数据集LLaVA-OneVision-1.5-Mid-Traning，以及一个精心策划的2600万指令数据集LLaVA-OneVision-1.5-Instruct，两者共同涵盖了640亿压缩后的多模态标记。（2）高效训练框架：我们开发了一套完整的端到端高效训练框架，采用离线并行数据打包策略，使得LLaVA-OneVision-1.5的训练能在16,000美元预算内完成。（3）顶尖性能表现：实验结果显示，LLaVA-OneVision1.5在广泛的下游任务中展现出极具竞争力的性能。具体而言，LLaVA-OneVision-1.5-8B在27个基准测试中的18个上超越了Qwen2.5-VL-7B，而LLaVA-OneVision-1.5-4B则在全部27个基准测试上均优于Qwen2.5-VL-3B。我们预计不久将发布LLaVA-OneVision-1.5-RL，并鼓励社区持续关注后续更新。

English

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

LLaVA-OneVision-1.5：全面开放框架，推动多模态训练民主化

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

摘要

Support