利用合成数据训练统一的多模态数据质量分类器
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
October 16, 2025
作者: Weizhi Wang, Rongmei Lin, Shiyang Li, Colin Lockard, Ritesh Sarkhel, Sanket Lokegaonkar, Jingbo Shang, Xifeng Yan, Nasser Zalmout, Xian Li
cs.AI
摘要
多模态大语言模型(MLLMs)持续在图像-文本描述数据与交错文档数据的混合体上进行预训练,然而,针对图像-文本交错文档数据的高质量数据筛选尚未得到充分探索。我们提出训练一个高效的多模态统一数据质量分类器(UniFilter),用以筛选高质量的图像-文本描述及交错数据。面对收集多样化标注多模态数据的挑战,我们引入了一种半合成方法,该方法利用易于获取的原始图像,并生成对应四个质量等级的文本,从而高效创建用于训练UniFilter的描述与交错文档数据的样本-评分对。我们将UniFilter应用于从DataComp描述数据集中精选高质量描述数据,以及从OBELICS图像-文本交错数据集中筛选高质量交错数据。基于筛选数据预训练的MLLMs展现出显著增强的能力,相较于基线筛选数据训练的模型,在零样本推理和上下文学习能力上表现更为突出。经过视觉监督微调后,这些由UniFilter引导的MLLMs在多项基准测试中取得了更优成绩,凸显了高质量多模态预训练的下游效益。我们向社区公开了用于训练UniFilter的合成训练数据、UniFilter模型检查点,以及由UniFilter精选的高质量交错文档子集OBELICS-HQ,以促进复现与进一步开发。
English
The Multimodal Large Language Models (MLLMs) are continually pre-trained on a
mixture of image-text caption data and interleaved document data, while the
high-quality data filtering towards image-text interleaved document data is
under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal
Data Quality Classifier to Filter both high-quality image-text caption and
interleaved data (UniFilter). To address the challenge of collecting diverse
labeled multimodal data, we introduce a semi-synthetic approach that leverages
readily available raw images and generates corresponding text across four
quality levels. This method enables efficient creation of sample-score pairs
for both caption and interleaved document data to train UniFilter. We apply
UniFilter to curate high-quality caption data from DataComp caption dataset and
interleaved data from the OBELICS image-text interleaved dataset. MLLMs
pre-trained on the filtered data demonstrate significantly enhanced
capabilities compared to those trained on baseline-filtered data, achieving
stronger zero-shot reasoning and in-context learning capabilities. After visual
supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger
performance on various benchmarks, highlighting the downstream benefits of
high-quality multimodal pre-training. We release the synthetic training data
used for training UniFilter, the UniFilter model checkpoints, and the
high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to
the community for reproduction and further development.