如果我們使用 LLaMA-3 重新為數十億張網路圖像加上新標題,會發生什麼事?
What If We Recaption Billions of Web Images with LLaMA-3?
June 12, 2024
作者: Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie
cs.AI
摘要
網路爬蟲的圖像-文字配對固有地存在噪音。先前的研究表明,語義對齊和豐富化這些配對的文本描述可以顯著增強模型在各種視覺-語言任務中的訓練效果,特別是文本到圖像生成。然而,在這個領域中,大規模的調查仍然主要是封閉源碼的。我們的論文旨在搭起這個社區努力的橋樑,利用功能強大且開源的LLaMA-3,一個等同於GPT-4級別的LLM。我們的重新標題流程很簡單:首先,我們微調一個由LLaMA-3-8B提供動力的LLaVA-1.5,然後利用它重新標題DataComp-1B數據集中的13億張圖像。我們的實證結果證實,這個增強的數據集Recap-DataComp-1B在訓練先進的視覺-語言模型方面提供了顯著的好處。對於像CLIP這樣的區分模型,我們觀察到在跨模態檢索任務中的零樣本表現有所提升。對於像文本到圖像擴散變換器這樣的生成模型,生成的圖像在與用戶的文本指示對齊方面有了顯著的改善,特別是在遵循複雜查詢時。我們的項目頁面是https://www.haqtu.me/Recap-Datacomp-1B/
English
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate
that semantically aligning and enriching textual descriptions of these pairs
can significantly enhance model training across various vision-language tasks,
particularly text-to-image generation. However, large-scale investigations in
this area remain predominantly closed-source. Our paper aims to bridge this
community effort, leveraging the powerful and open-sourced LLaMA-3, a
GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a
LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images
from the DataComp-1B dataset. Our empirical results confirm that this enhanced
dataset, Recap-DataComp-1B, offers substantial benefits in training advanced
vision-language models. For discriminative models like CLIP, we observe
enhanced zero-shot performance in cross-modal retrieval tasks. For generative
models like text-to-image Diffusion Transformers, the generated images exhibit
a significant improvement in alignment with users' text instructions,
especially in following complex queries. Our project page is
https://www.haqtu.me/Recap-Datacomp-1B/Summary
AI-Generated Summary