Open-Qwen2VL: Recheneffizientes Pre-Training von vollständig offenen multimodalen LLMs mit akademischen Ressourcen

papers.abstract

Die Reproduktion von state-of-the-art Multimodal Large Language Model (MLLM) Vor-Trainings stößt in jeder Phase der Pipeline auf Hindernisse, einschließlich der Filterung hochwertiger Daten, Strategien zur Mischung multimodaler Daten, Techniken zur Sequenzpackung und Trainingsframeworks. Wir stellen Open-Qwen2VL vor, ein vollständig quelloffenes Multimodales Großes Sprachmodell mit 2B Parametern, das effizient auf 29M Bild-Text-Paaren mit nur 442 A100-40G GPU-Stunden vorab trainiert wurde. Unser Ansatz nutzt eine dynamische Bildauflösung von niedrig bis hoch und multimodale Sequenzpackung, um die Effizienz des Vor-Trainings erheblich zu steigern. Der Trainingsdatensatz wurde sorgfältig kuratiert, indem sowohl MLLM-basierte Filtertechniken (z.B. MLM-Filter) als auch konventionelle CLIP-basierte Filtermethoden verwendet wurden, was die Datenqualität und Trainings effizienz erheblich verbesserte. Das Open-Qwen2VL Vor-Training wurde auf akademischer Ebene auf 8xA100-40G GPUs an der UCSB mit 5B gepackten multimodalen Tokens durchgeführt, was 0,36% der 1,4T multimodalen Vor-Training-Tokens von Qwen2-VL entspricht. Das final instruktionsfeinabgestimmte Open-Qwen2VL übertrifft das teilweise offene state-of-the-art MLLM Qwen2-VL-2B in verschiedenen multimodalen Benchmarks wie MMBench, SEEDBench, MMstar und MathVista, was die bemerkenswerte Trainings effizienz von Open-Qwen2VL unterstreicht. Wir stellen alle Aspekte unserer Arbeit quelloffen zur Verfügung, einschließlich rechen- und dateneffizienter Trainingsdetails, Datenfilterungsmethoden, Sequenzpackungsskripte, Vor-Trainingsdaten im WebDataset-Format, FSDP-basierte Trainingscodebasis sowie sowohl Basis- als auch instruktionsfeinabgestimmte Modellcheckpoints. Wir definieren „vollständig offen“ für multimodale LLMs neu als die vollständige Veröffentlichung von: 1) der Trainingscodebasis, 2) detaillierten Datenfiltertechniken und 3) allen Vor-Trainings- und überwachten Feinabstimmungsdaten, die zur Entwicklung des Modells verwendet wurden.

English

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.

Open-Qwen2VL: Recheneffizientes Pre-Training von vollständig offenen multimodalen LLMs mit akademischen Ressourcen

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

papers.abstract

Support