Open-Qwen2VL：基於學術資源的全開放多模態大語言模型高效能預訓練

摘要

在重現最先進的多模態大型語言模型（LLM）預訓練過程中，每個階段都面臨著障礙，包括高質量數據過濾、多模態數據混合策略、序列打包技術以及訓練框架。我們推出了Open-Qwen2VL，這是一個完全開源的20億參數多模態大型語言模型，僅使用442個A100-40G GPU小時，在2900萬圖像-文本對上高效地進行了預訓練。我們的方法採用了從低到高的動態圖像分辨率和多模態序列打包，顯著提升了預訓練效率。訓練數據集通過基於MLLM的過濾技術（如MLM-Filter）和傳統的CLIP過濾方法精心篩選，大幅提高了數據質量和訓練效率。Open-Qwen2VL的預訓練在UCSB的學術級8xA100-40G GPU上進行，處理了50億個打包的多模態token，僅佔Qwen2-VL 1.4萬億多模態預訓練token的0.36%。最終經過指令微調的Open-Qwen2VL在多個多模態基準測試（如MMBench、SEEDBench、MMstar和MathVista）上超越了部分開源的最先進MLLM Qwen2-VL-2B，顯示了Open-Qwen2VL卓越的訓練效率。我們開源了工作的所有方面，包括計算效率和數據效率的訓練細節、數據過濾方法、序列打包腳本、WebDataset格式的預訓練數據、基於FSDP的訓練代碼庫，以及基礎模型和指令微調模型的檢查點。我們重新定義了多模態LLM的“完全開源”，即完整發布：1）訓練代碼庫，2）詳細的數據過濾技術，以及3）用於開發模型的所有預訓練和監督微調數據。

English

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.

Open-Qwen2VL：基於學術資源的全開放多模態大語言模型高效能預訓練

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

摘要

Support