ChatPaper.aiChatPaper

BLIP3-KALE:知识增强的大规模密集字幕

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

November 12, 2024
作者: Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu
cs.AI

摘要

我们介绍了BLIP3-KALE数据集,其中包含2.18亿个图像-文本对,弥合了描述性合成标题和事实性网络规模替代文本之间的差距。KALE通过将合成的密集图像标题与网络规模的替代文本相结合,生成基于事实的图像标题。我们的两阶段方法利用大型视觉-语言模型和语言模型创建知识增强标题,然后用于训练专门的VLM以扩大数据集。我们在KALE上训练视觉-语言模型,并展示在视觉-语言任务上的改进。我们的实验表明,KALE对于训练更具能力和知识的多模态模型具有实用性。我们在https://huggingface.co/datasets/Salesforce/blip3-kale发布了KALE数据集。
English
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

Summary

AI-Generated Summary

PDF232November 13, 2024