如果我们使用LLaMA-3重新为数十亿张网络图片添加标题会发生什么？

摘要

网络爬虫抓取的图像文本对固有地带有噪音。先前的研究表明，语义对齐和丰富这些对的文本描述可以显著增强模型在各种视觉语言任务中的训练效果，特别是文本到图像生成。然而，这一领域的大规模调查仍然主要是闭源的。我们的论文旨在弥合这一社区努力，利用功能强大且开源的LLaMA-3，一个GPT-4级别的LLM。我们的重新描述流程很简单：首先，我们微调一个由LLaMA-3-8B提供支持的LLaVA-1.5，然后利用它重新描述DataComp-1B数据集中的13亿张图像。我们的实证结果证实，这一增强数据集Recap-DataComp-1B在训练先进的视觉语言模型方面提供了实质性的好处。对于像CLIP这样的判别模型，我们观察到在跨模态检索任务中的零样本性能得到了提升。对于像文本到图像扩散变换器这样的生成模型，生成的图像在与用户的文本指令对齐方面表现出显著改善，特别是在遵循复杂查询时。我们的项目页面是https://www.haqtu.me/Recap-Datacomp-1B/

English

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and open-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

如果我们使用LLaMA-3重新为数十亿张网络图片添加标题会发生什么？

What If We Recaption Billions of Web Images with LLaMA-3?

摘要

Support