CLIPのスケーリング（ダウン）：データ、アーキテクチャ、トレーニング戦略の包括的分析

要旨

本論文では、限られた計算予算にスケールダウンした場合のContrastive Language-Image Pre-training（CLIP）の性能を調査する。我々はCLIPをデータ、アーキテクチャ、学習戦略の3つの次元に沿って探求する。データに関しては、高品質な訓練データの重要性を実証し、高品質なデータの小規模データセットが低品質な大規模データセットを上回ることを示す。また、モデルの性能が異なるデータセットサイズでどのように変化するかを検証し、小規模なViTモデルは小規模データセットに適している一方、大規模モデルは固定計算量で大規模データセットでより良い性能を発揮することを示唆する。さらに、CLIPの訓練においてCNNベースのアーキテクチャとViTベースのアーキテクチャを選択する際の指針を提供する。我々は4つのCLIP訓練戦略 - SLIP、FLIP、CLIP、およびCLIP+データ拡張 - を比較し、訓練戦略の選択が利用可能な計算資源に依存することを示す。我々の分析により、CLIP+データ拡張は訓練データの半分のみを使用してCLIPと同等の性能を達成できることが明らかになった。本研究は、CLIPモデルを効果的に訓練および展開し、様々なアプリケーションでの実用的な使用をよりアクセスしやすく、費用対効果の高いものにするための実践的な洞察を提供する。

English

This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.

CLIPのスケーリング（ダウン）：データ、アーキテクチャ、トレーニング戦略の包括的分析

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

要旨

Support