ChatPaper.aiChatPaper

訓練一個基於合成數據的統一多模態數據質量分類器

Train a Unified Multimodal Data Quality Classifier with Synthetic Data

October 16, 2025
作者: Weizhi Wang, Rongmei Lin, Shiyang Li, Colin Lockard, Ritesh Sarkhel, Sanket Lokegaonkar, Jingbo Shang, Xifeng Yan, Nasser Zalmout, Xian Li
cs.AI

摘要

多模態大型語言模型(MLLMs)持續在圖像-文本描述數據與交錯文檔數據的混合數據集上進行預訓練,然而針對圖像-文本交錯文檔數據的高質量數據過濾技術尚未得到充分探索。本文提出訓練一種高效的多模態數據質量分類器——統一多模態數據質量過濾器(UniFilter),用以篩選高質量的圖像-文本描述及交錯數據。為解決收集多樣化標註多模態數據的挑戰,我們引入了一種半合成方法,該方法利用易獲取的原始圖像,並生成對應四種質量級別的文本,從而高效創建用於訓練UniFilter的描述與交錯文檔數據的樣本-分數對。我們應用UniFilter從DataComp描述數據集中精選高質量描述數據,並從OBELICS圖像-文本交錯數據集中篩選高質量交錯數據。基於過濾後數據預訓練的MLLMs,相比於使用基準過濾數據訓練的模型,展現出顯著增強的能力,包括更強的零樣本推理與上下文學習能力。經過視覺監督微調後,這些由UniFilter引導的MLLMs在多個基準測試中表現更佳,凸顯了高質量多模態預訓練的下游效益。我們向社區公開了用於訓練UniFilter的合成訓練數據、UniFilter模型檢查點,以及由UniFilter精選的高質量交錯文檔子集OBELICS-HQ,以促進復現與進一步開發。
English
The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.
PDF22October 20, 2025