RealSyn: 効果的かつスケーラブルなマルチモーダル・インターリーブド文書変換パラダイム

要旨

大規模な画像-テキストペアで事前学習を行った後、Contrastive Language-Image Pre-training (CLIP)は多様なベンチマークで有望な性能を示します。しかし、マルチモーダルなインタリーブドキュメントなどの非ペアデータの大部分は、視覚-言語表現学習において十分に活用されていません。これらの非ペアドキュメントを最大限に活用するために、まず高品質な画像とテキストを抽出するReal-World Data Extractionパイプラインを確立します。次に、各画像を複数の意味的に関連する現実的なテキストと効率的に関連付ける階層的検索手法を設計します。さらに、細粒度の視覚情報を強化するために、合成テキスト生成のための画像意味拡張生成モジュールを提案します。また、データセットの多様性を向上させるために意味的バランスサンプリング戦略を採用し、ロングテール概念の学習を改善します。これらの革新に基づいて、現実的および合成テキストを組み合わせたRealSynデータセットを構築し、15M、30M、100Mの3つのスケールで提供します。大規模な実験により、RealSynが視覚-言語表現学習を効果的に推進し、強力なスケーラビリティを示すことが実証されています。RealSynで事前学習したモデルは、複数の下流タスクで最先端の性能を達成します。今後の研究を促進するため、RealSynデータセットと事前学習済みモデルの重みをhttps://github.com/deepglint/RealSynで公開しています。

English

After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that RealSyn effectively advances vision-language representation learning and exhibits strong scalability. Models pre-trained on RealSyn achieve state-of-the-art performance on multiple downstream tasks. To facilitate future research, the RealSyn dataset and pre-trained model weights are released at https://github.com/deepglint/RealSyn.

RealSyn: 効果的かつスケーラブルなマルチモーダル・インターリーブド文書変換パラダイム

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

要旨

Support