Open-Qwen2VL: Rekenkundig efficiënte voorafgaande training van volledig open multimodale LLM's op academische bronnen

Samenvatting

De reproductie van state-of-the-art multimodale LLM-pre-training stuit op barrières in elke fase van de pijplijn, waaronder hoogwaardige datafiltering, multimodale datamengstrategieën, sequentiepakkingstechnieken en trainingsframeworks. Wij introduceren Open-Qwen2VL, een volledig open-source Multimodaal Taalmodel met 2B parameters, efficiënt voorgetraind op 29M afbeelding-tekstparen met slechts 442 A100-40G GPU-uren. Onze aanpak maakt gebruik van dynamische afbeeldingsresolutie van laag naar hoog en multimodale sequentiepakking om de pre-trainingefficiëntie aanzienlijk te verbeteren. De trainingsdataset werd zorgvuldig samengesteld met behulp van zowel MLLM-gebaseerde filtertechnieken (bijv. MLM-Filter) als conventionele CLIP-gebaseerde filtermethoden, wat de data kwaliteit en trainingsefficiëntie aanzienlijk verbeterde. De Open-Qwen2VL-pre-training wordt uitgevoerd op academisch niveau met 8xA100-40G GPU's aan de UCSB op 5B gepakte multimodale tokens, wat 0,36% is van de 1,4T multimodale pre-trainingtokens van Qwen2-VL. De uiteindelijke instructie-afgestemde Open-Qwen2VL presteert beter dan het gedeeltelijk open-source state-of-the-art MLLM Qwen2-VL-2B op verschillende multimodale benchmarks van MMBench, SEEDBench, MMstar en MathVista, wat de opmerkelijke trainingsefficiëntie van Open-Qwen2VL aantoont. Wij maken alle aspecten van ons werk open-source, inclusief compute-efficiënte en data-efficiënte trainingsdetails, datafiltermethoden, sequentiepakking scripts, pre-trainingdata in WebDataset-formaat, FSDP-gebaseerde trainingscodebase, en zowel de basis- als instructie-afgestemde modelcheckpoints. Wij herdefiniëren "volledig open" voor multimodale LLM's als de volledige release van: 1) de trainingscodebase, 2) gedetailleerde datafiltertechnieken, en 3) alle pre-training en supervised fine-tuning data die gebruikt zijn om het model te ontwikkelen.

English

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.

Open-Qwen2VL: Rekenkundig efficiënte voorafgaande training van volledig open multimodale LLM's op academische bronnen

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

Samenvatting

Support