用於模型訓練的合成圖像的規模定律... 現在就是這樣。
Scaling Laws of Synthetic Images for Model Training ... for Now
December 7, 2023
作者: Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, Yonglong Tian
cs.AI
摘要
最近在文本到圖像模型方面取得的重大進展,開啟了使用合成圖像訓練視覺系統的可能性,潛在地克服了在大規模收集經過精心策劃的數據方面的困難。然而,目前尚不清楚這些模型在大規模情況下的表現,隨著訓練集中添加更多合成數據。本文研究了當前最先進的文本到圖像模型生成的合成圖像的擴展規律,用於監督模型的訓練:具有標籤監督的圖像分類器,以及具有語言監督的CLIP。我們確定了幾個因素,包括文本提示、無分類器指導規模和文本到圖像模型的類型,這些因素明顯影響了擴展行為。在調整這些因素後,我們觀察到合成圖像在CLIP訓練中呈現出與真實圖像類似但略遜一籌的擴展趨勢,而在訓練監督圖像分類器時明顯表現不佳。我們的分析表明,這種表現不佳的主要原因是現成的文本到圖像模型無法生成某些概念,這一限制嚴重影響了圖像分類器的訓練。我們的研究結果還表明,擴展合成數據在以下情況下可能特別有效:(1)當監督問題的真實圖像供應有限(例如,在ImageNet中少於50萬張圖像),(2)當評估數據集與訓練數據明顯不同,表明處於分布之外的情況,或(3)當合成數據與真實圖像一起使用,如在訓練CLIP模型時所示。
English
Recent significant advances in text-to-image models unlock the possibility of
training vision systems using synthetic images, potentially overcoming the
difficulty of collecting curated data at scale. It is unclear, however, how
these models behave at scale, as more synthetic data is added to the training
set. In this paper we study the scaling laws of synthetic images generated by
state of the art text-to-image models, for the training of supervised models:
image classifiers with label supervision, and CLIP with language supervision.
We identify several factors, including text prompts, classifier-free guidance
scale, and types of text-to-image models, that significantly affect scaling
behavior. After tuning these factors, we observe that synthetic images
demonstrate a scaling trend similar to, but slightly less effective than, real
images in CLIP training, while they significantly underperform in scaling when
training supervised image classifiers. Our analysis indicates that the main
reason for this underperformance is the inability of off-the-shelf
text-to-image models to generate certain concepts, a limitation that
significantly impairs the training of image classifiers. Our findings also
suggest that scaling synthetic data can be particularly effective in scenarios
such as: (1) when there is a limited supply of real images for a supervised
problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the
evaluation dataset diverges significantly from the training data, indicating
the out-of-distribution scenario, or (3) when synthetic data is used in
conjunction with real images, as demonstrated in the training of CLIP models.