微調的多模態語言模型是高品質的圖像文字數據篩選器。

摘要

我們提出了一個新穎的框架，通過利用微調的多模態語言模型（MLMs）來過濾圖像文本數據。我們的方法通過整合MLMs的最新進展，優於主流的過濾方法（例如CLIPScore）。我們設計了四個獨特但互補的指標，全面衡量圖像文本數據的質量。建立了一個新的流程，用於構建高質量的指導數據，以微調MLMs作為數據過濾器。與CLIPScore相比，我們的MLM過濾器產生更準確和全面的分數，直接提高了過濾數據的質量，並提升了預訓練模型的性能。我們在流行的基礎模型（即CLIP和BLIP2）和各種下游任務上實現了顯著改進。我們的MLM過濾器可以泛化到不同的模型和任務，並可用作CLIPScore的即插即用替代品。提供了額外的消融研究，以驗證我們對MLM過濾器的設計選擇。

English

We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs). Our approach outperforms predominant filtering methods (e.g., CLIPScore) via integrating the recent advances in MLMs. We design four distinct yet complementary metrics to holistically measure the quality of image-text data. A new pipeline is established to construct high-quality instruction data for fine-tuning MLMs as data filters. Comparing with CLIPScore, our MLM filters produce more precise and comprehensive scores that directly improve the quality of filtered data and boost the performance of pre-trained models. We achieve significant improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2) and various downstream tasks. Our MLM filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore. An additional ablation study is provided to verify our design choices for the MLM filter.

微調的多模態語言模型是高品質的圖像文字數據篩選器。

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

摘要

Support