ChatPaper.aiChatPaper

BLIP3-KALE:知識增強的大規模密集標題

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

November 12, 2024
作者: Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu
cs.AI

摘要

我們介紹了BLIP3-KALE,這是一個包含2.18億個圖像-文字配對的數據集,彌合了描述性合成標題與事實性網絡規模替代文字之間的差距。KALE通過將網絡規模替代文字與合成密集圖像標題相結合,生成基於事實的圖像標題。我們的兩階段方法利用大型視覺-語言模型和語言模型來創建知識增強的標題,然後用於訓練一個專門的VLM以擴展數據集。我們在KALE上訓練視覺-語言模型,並展示在視覺-語言任務上的改進。我們的實驗表明KALE對於訓練更具能力和知識的多模型模型具有實用價值。我們在https://huggingface.co/datasets/Salesforce/blip3-kale 上釋出了KALE數據集。
English
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale
PDF232November 13, 2024