オープン基盤言語視覚モデルとデータセットのロバストな比較のためのスケーリング則

要旨

転移学習の研究において、スケーリング則は、さまざまな重要な基盤モデルの特性と性能を大規模なスケールで予測するために得られる。本論文では、スケーリング則の導出がモデルとデータセットの比較にも利用できることを示し、事前学習にどの手法を優先すべきかを決定することを可能にする。初めて、CLIPとMaMMUTという2つの重要な言語-視覚学習手法について、モデルとサンプルサイズの広範な範囲にわたる密な測定に基づく完全なスケーリング則が導出された。これらの手法は、対照的な損失のみを使用するか、または対照的な損失とキャプションテキスト生成損失の両方を使用する。保持されたデータポイントに対する十分な予測精度を確保するため、導出されたスケーリング則を使用して両モデルを比較し、MaMMUTがスケールに応じてより強い改善を示し、標準的なCLIPよりも優れたサンプル効率を持つことを示す証拠を得た。比較の妥当性を強化するため、分類、検索、セグメンテーションといったさまざまな下流タスク、およびDataComp、DFN、Re-LAIONといった異なるオープンデータセットに対するスケーリング則を示し、一貫して同じ傾向を観察した。また、学習率スケジュールを一定に保ってスケーリング則を導出する場合でも比較が可能であることを示し、計算コストを削減した。スケーリング則の正確な導出は、単一の参照スケールからの測定に基づく誤った結論を避け、オープンな基盤モデルとその作成のためのデータセットの体系的な比較と改善の道を開く手段を提供する。我々は、中間チェックポイントを含むすべての事前学習済みモデルを公開し、その中にはDataComp-1.4Bの12.8Bサンプルで訓練されたopenMaMMUT-L/14も含まれる。このモデルは、ゼロショットImageNet-1k精度で80.3%を達成している。論文の実験を再現するためのコードと生の実験データは、https://github.com/LAION-AI/scaling-laws-for-comparison で見つけることができる。

English

In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves 80.3% zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.

オープン基盤言語視覚モデルとデータセットのロバストな比較のためのスケーリング則

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

要旨

Support