ChatPaper.aiChatPaper

縮放(下調)CLIP:對數據、架構和訓練策略進行全面分析

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

April 12, 2024
作者: Zichao Li, Cihang Xie, Ekin Dogus Cubuk
cs.AI

摘要

本文探討了當將對比語言-圖像預訓練(CLIP)按照有限計算預算進行縮放時的性能。我們從數據、架構和訓練策略三個維度探索了CLIP。在數據方面,我們證明了高質量訓練數據的重要性,並表明高質量數據的較小數據集可以優於質量較低的較大數據集。我們還研究了模型性能如何隨不同數據集大小而變化,暗示較小的ViT模型更適合較小的數據集,而較大的模型在固定計算資源的較大數據集上表現更好。此外,我們提供了何時選擇基於CNN架構或ViT架構進行CLIP訓練的指導。我們比較了四種CLIP訓練策略 - SLIP、FLIP、CLIP和CLIP+數據增強 - 並表明訓練策略的選擇取決於可用的計算資源。我們的分析顯示,CLIP+數據增強可以在僅使用一半訓練數據的情況下實現與CLIP相當的性能。這項工作提供了關於如何有效訓練和部署CLIP模型的實用見解,使其在各種應用中更易於訪問和負擔得起。
English
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.

Summary

AI-Generated Summary

PDF301December 15, 2024