缩放(下)CLIP:数据、架构和训练策略的全面分析
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
April 12, 2024
作者: Zichao Li, Cihang Xie, Ekin Dogus Cubuk
cs.AI
摘要
本文研究了将对比语言-图像预训练(CLIP)在计算资源有限的情况下进行规模化的性能。我们从数据、架构和训练策略三个维度探讨了CLIP。在数据方面,我们展示了高质量训练数据的重要性,并表明高质量数据的较小数据集可以胜过质量较低的较大数据集。我们还研究了模型性能如何随不同数据集大小变化,表明较小的ViT模型更适用于较小数据集,而较大模型在具有固定计算资源的较大数据集上表现更好。此外,我们提供了关于何时选择基于CNN架构或ViT架构进行CLIP训练的指导。我们比较了四种CLIP训练策略 - SLIP、FLIP、CLIP和CLIP+数据增强 - 并表明训练策略的选择取决于可用的计算资源。我们的分析显示,CLIP+数据增强可以在仅使用一半训练数据的情况下实现与仅使用CLIP相当的性能。这项工作为如何有效训练和部署CLIP模型提供了实用见解,使它们在各种应用中更易于获取和负担得起。
English
This paper investigates the performance of the Contrastive Language-Image
Pre-training (CLIP) when scaled down to limited computation budgets. We explore
CLIP along three dimensions: data, architecture, and training strategies. With
regards to data, we demonstrate the significance of high-quality training data
and show that a smaller dataset of high-quality data can outperform a larger
dataset with lower quality. We also examine how model performance varies with
different dataset sizes, suggesting that smaller ViT models are better suited
for smaller datasets, while larger models perform better on larger datasets
with fixed compute. Additionally, we provide guidance on when to choose a
CNN-based architecture or a ViT-based architecture for CLIP training. We
compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data
Augmentation - and show that the choice of training strategy depends on the
available compute resource. Our analysis reveals that CLIP+Data Augmentation
can achieve comparable performance to CLIP using only half of the training
data. This work provides practical insights into how to effectively train and
deploy CLIP models, making them more accessible and affordable for practical
use in various applications.Summary
AI-Generated Summary