ビジョン言語モデルのための1,000億個のデータへの事前トレーニングのスケーリング

要旨

我々は、1000億の例という前例のない規模で、視覚言語モデルの事前学習の潜在能力について実証的な調査を行っています。我々は、COCOキャプションなどの一般的な西洋中心の分類および検索ベンチマークにおいて、この規模でのモデルの性能が飽和する傾向があることを見つけました。それにもかかわらず、文化的多様性のタスクは、1000億規模のウェブデータからのより実質的な利益を得ています。これは、そのデータがロングテールの概念をカバーしているためです。さらに、我々はモデルの多言語性を分析し、低リソース言語でも利益を示しています。さらに、CLIPなどの品質フィルタを使用して事前学習データセットのサイズを削減することが、通常はパフォーマンスを向上させるために使用されますが、大規模データセットでも表現される文化的多様性を意図せずに減少させる可能性があることを観察しています。我々の結果は、伝統的なベンチマークはノイズの多い生のウェブデータを1000億の例にスケーリングしても、大きな利益を得ることはないかもしれませんが、このデータ規模は本当に包括的なマルチモーダルシステムを構築するために不可欠であることを強調しています。

English

We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.

ビジョン言語モデルのための1,000億個のデータへの事前トレーニングのスケーリング

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

要旨

Support