LLaVA-OneVision-1.5: 민주화된 멀티모달 학습을 위한 완전 오픈 프레임워크

초록

저희는 LLaVA-OneVision-1.5라는 새로운 대규모 멀티모달 모델(LMM) 패밀리를 소개합니다. 이 모델은 상당히 줄어든 계산 및 재정 비용으로도 최첨단 성능을 달성합니다. 기존 연구와 달리, LLaVA-OneVision-1.5는 처음부터 고품질의 시각-언어 모델을 구축하기 위한 개방적이고 효율적이며 재현 가능한 프레임워크를 제공합니다. LLaVA-OneVision-1.5 릴리스는 세 가지 주요 구성 요소로 이루어져 있습니다: (1) 대규모 큐레이션 데이터셋: 8,500만 개의 개념 균형 프리트레이닝 데이터셋 LLaVA-OneVision-1.5-Mid-Training과 신중하게 큐레이션된 2,600만 개의 인스트럭션 데이터셋 LLaVA-OneVision-1.5-Instruct를 구축하여, 총 640억 개의 압축된 멀티모달 토큰을 포함합니다. (2) 효율적인 훈련 프레임워크: 오프라인 병렬 데이터 패킹 전략을 활용하여 $16,000 예산 내에서 LLaVA-OneVision-1.5의 훈련을 용이하게 하는 완전한 엔드투엔드 효율적 훈련 프레임워크를 개발했습니다. (3) 최첨단 성능: 실험 결과, LLaVA-OneVision-1.5는 다양한 다운스트림 작업에서 매우 경쟁력 있는 성능을 보여줍니다. 구체적으로, LLaVA-OneVision-1.5-8B는 27개 벤치마크 중 18개에서 Qwen2.5-VL-7B를 능가하며, LLaVA-OneVision-1.5-4B는 27개 벤치마크 모두에서 Qwen2.5-VL-3B를 앞섭니다. 저희는 곧 LLaVA-OneVision-1.5-RL을 출시할 예정이며, 커뮤니티가 추가 업데이트를 기다리기를 권장합니다.

English

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

LLaVA-OneVision-1.5: 민주화된 멀티모달 학습을 위한 완전 오픈 프레임워크

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

초록

Support