微调的多模态语言模型是高质量的图像文本数据过滤器。

摘要

我们提出了一个新颖的框架，通过利用微调的多模态语言模型（MLM）来过滤图像文本数据。我们的方法通过整合MLM的最新进展，优于主流的过滤方法（例如CLIPScore）。我们设计了四个独特但互补的度量标准，全面衡量图像文本数据的质量。我们建立了一个新的流程，用于构建高质量的指导数据，以微调MLM作为数据过滤器。与CLIPScore相比，我们的MLM过滤器产生更精确和全面的分数，直接提高了过滤数据的质量，并提升了预训练模型的性能。我们在流行的基础模型（即CLIP和BLIP2）和各种下游任务上取得了显著改进。我们的MLM过滤器可以推广到不同的模型和任务，并可用作CLIPScore的即插即用替代品。我们还提供了额外的消融研究，以验证我们对MLM过滤器的设计选择。

English

We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs). Our approach outperforms predominant filtering methods (e.g., CLIPScore) via integrating the recent advances in MLMs. We design four distinct yet complementary metrics to holistically measure the quality of image-text data. A new pipeline is established to construct high-quality instruction data for fine-tuning MLMs as data filters. Comparing with CLIPScore, our MLM filters produce more precise and comprehensive scores that directly improve the quality of filtered data and boost the performance of pre-trained models. We achieve significant improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2) and various downstream tasks. Our MLM filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore. An additional ablation study is provided to verify our design choices for the MLM filter.

微调的多模态语言模型是高质量的图像文本数据过滤器。

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

摘要

Support