开放基础语言视觉模型与数据集的稳健比较之尺度定律

摘要

在可迁移学习的研究中，针对多种重要基础模型，我们获得了尺度定律，以预测其在更大规模下的属性与性能。本文展示了如何利用尺度定律的推导进行模型与数据集的比较，从而决定哪种预训练程序更为优选。首次基于密集测量，在广泛的模型与样本规模范围内，我们为两种重要的语言-视觉学习程序——仅使用对比损失的CLIP和同时使用对比与文本生成损失的MaMMUT——推导出了完整的尺度定律。为确保对保留点的预测准确性，我们利用推导出的尺度定律对两模型进行了比较，发现MaMMUT在规模扩展下展现出更强的改进能力及优于标准CLIP的样本效率。为增强比较的有效性，我们展示了多种下游任务（分类、检索、分割）及不同开放数据集（DataComp、DFN、Re-LAION）的尺度定律，观察到一致的趋势。我们还表明，在采用恒定学习率计划推导尺度定律时，亦可进行比较，从而降低计算成本。因此，准确推导尺度定律为跨规模范围进行模型与数据集比较提供了手段，避免了仅基于单一参考尺度测量得出的误导性结论，为系统比较与改进开放基础模型及其创建数据集铺平了道路。我们发布了所有预训练模型及其中间检查点，包括在DataComp-1.4B的12.8B样本上训练、达到80.3%零样本ImageNet-1k准确率的openMaMMUT-L/14。论文中实验复现代码及原始实验数据可在https://github.com/LAION-AI/scaling-laws-for-comparison获取。

English

In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves 80.3% zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.

开放基础语言视觉模型与数据集的稳健比较之尺度定律

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

摘要

Support