Open-Qwen2VL:基於學術資源的全開放多模態大語言模型高效能預訓練
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
April 1, 2025
作者: Weizhi Wang, Yu Tian, Linjie Yang, Heng Wang, Xifeng Yan
cs.AI
摘要
在重現最先進的多模態大型語言模型(LLM)預訓練過程中,每個階段都面臨著障礙,包括高質量數據過濾、多模態數據混合策略、序列打包技術以及訓練框架。我們推出了Open-Qwen2VL,這是一個完全開源的20億參數多模態大型語言模型,僅使用442個A100-40G GPU小時,在2900萬圖像-文本對上高效地進行了預訓練。我們的方法採用了從低到高的動態圖像分辨率和多模態序列打包,顯著提升了預訓練效率。訓練數據集通過基於MLLM的過濾技術(如MLM-Filter)和傳統的CLIP過濾方法精心篩選,大幅提高了數據質量和訓練效率。Open-Qwen2VL的預訓練在UCSB的學術級8xA100-40G GPU上進行,處理了50億個打包的多模態token,僅佔Qwen2-VL 1.4萬億多模態預訓練token的0.36%。最終經過指令微調的Open-Qwen2VL在多個多模態基準測試(如MMBench、SEEDBench、MMstar和MathVista)上超越了部分開源的最先進MLLM Qwen2-VL-2B,顯示了Open-Qwen2VL卓越的訓練效率。我們開源了工作的所有方面,包括計算效率和數據效率的訓練細節、數據過濾方法、序列打包腳本、WebDataset格式的預訓練數據、基於FSDP的訓練代碼庫,以及基礎模型和指令微調模型的檢查點。我們重新定義了多模態LLM的“完全開源”,即完整發布:1)訓練代碼庫,2)詳細的數據過濾技術,以及3)用於開發模型的所有預訓練和監督微調數據。
English
The reproduction of state-of-the-art multimodal LLM pre-training faces
barriers at every stage of the pipeline, including high-quality data filtering,
multimodal data mixture strategies, sequence packing techniques, and training
frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter
Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs
using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic
image resolution and multimodal sequence packing to significantly enhance
pre-training efficiency. The training dataset was carefully curated using both
MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based
filtering methods, substantially improving data quality and training
efficiency. The Open-Qwen2VL pre-training is conducted on academic level
8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T
multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned
Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on
various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista,
indicating the remarkable training efficiency of Open-Qwen2VL. We open-source
all aspects of our work, including compute-efficient and data-efficient
training details, data filtering methods, sequence packing scripts,
pre-training data in WebDataset format, FSDP-based training codebase, and both
base and instruction-tuned model checkpoints. We redefine "fully open" for
multimodal LLMs as the complete release of: 1) the training codebase, 2)
detailed data filtering techniques, and 3) all pre-training and supervised
fine-tuning data used to develop the model.Summary
AI-Generated Summary