BLIP3-KALE：知識拡張型大規模密なキャプション

要旨

私たちは、記述的な合成キャプションと事実に基づくウェブ規模の代替テキストとの間のギャップを埋める、218百万の画像テキストペアのデータセットであるBLIP3-KALEを紹介します。KALEは、合成された密な画像キャプションをウェブ規模の代替テキストで拡張し、事実に基づいた画像キャプションを生成します。私たちの2段階アプローチは、大規模なビジョン言語モデルと言語モデルを活用して、知識を拡張したキャプションを作成し、その後、データセットのスケーリングのために特化したVLMをトレーニングするために使用されます。私たちはKALEでビジョン言語モデルをトレーニングし、ビジョン言語タスクでの改善を示します。私たちの実験は、より能力が高く知識豊富なマルチモーダルモデルをトレーニングするためのKALEの有用性を示しています。私たちは、KALEデータセットを以下のURLで公開しています：https://huggingface.co/datasets/Salesforce/blip3-kale

English

We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

BLIP3-KALE：知識拡張型大規模密なキャプション

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

要旨

Support