LLaVA-OneVision-1.5: Ein vollständig offenes Framework für demokratisiertes multimodales Training

papers.abstract

Wir präsentieren LLaVA-OneVision-1.5, eine neuartige Familie von Large Multimodal Models (LMMs), die Spitzenleistungen bei deutlich reduzierten Rechen- und Finanzkosten erzielen. Im Gegensatz zu bestehenden Arbeiten bietet LLaVA-OneVision-1.5 ein offenes, effizientes und reproduzierbares Framework für den Aufbau hochwertiger Vision-Language-Modelle von Grund auf. Die Veröffentlichung von LLaVA-OneVision-1.5 umfasst drei Hauptkomponenten: (1) Groß angelegte kuratierte Datensätze: Wir haben einen 85M konzeptausgewogenen Pretraining-Datensatz LLaVA-OneVision-1.5-Mid-Training und einen sorgfältig kuratierten 26M Instruktionsdatensatz LLaVA-OneVision-1.5-Instruct erstellt, die zusammen 64B komprimierte multimodale Tokens umfassen. (2) Effizientes Trainingsframework: Wir entwickeln ein vollständiges End-to-End-effizientes Trainingsframework, das eine Offline-Parallel-Datenpackungsstrategie nutzt, um das Training von LLaVA-OneVision-1.5 innerhalb eines Budgets von 16.000 US-Dollar zu ermöglichen. (3) Spitzenleistungen: Experimentelle Ergebnisse zeigen, dass LLaVA-OneVision-1.5 über eine breite Palette von Downstream-Aufgaben hinweg außerordentlich wettbewerbsfähige Leistungen erzielt. Insbesondere übertrifft LLaVA-OneVision-1.5-8B Qwen2.5-VL-7B auf 18 von 27 Benchmarks, und LLaVA-OneVision-1.5-4B übertrifft Qwen2.5-VL-3B auf allen 27 Benchmarks. Wir planen, LLaVA-OneVision-1.5-RL in Kürze zu veröffentlichen und ermutigen die Community, auf weitere Updates zu warten.

English

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

LLaVA-OneVision-1.5: Ein vollständig offenes Framework für demokratisiertes multimodales Training

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

papers.abstract

Support