ChatPaper.aiChatPaper

用于模型训练的合成图像的规模定律... 现在。

Scaling Laws of Synthetic Images for Model Training ... for Now

December 7, 2023
作者: Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, Yonglong Tian
cs.AI

摘要

最近在文本到图像模型方面取得了重大进展,这开启了使用合成图像训练视觉系统的可能性,潜在地克服了大规模收集经过筛选的数据的困难。然而,目前尚不清楚这些模型在大规模情况下的行为,随着训练集中添加更多合成数据。本文研究了最先进的文本到图像模型生成的合成图像的规模定律,用于训练监督模型:带标签监督的图像分类器以及具有语言监督的CLIP。我们确定了几个因素,包括文本提示、无分类器指导规模和文本到图像模型类型,这些因素显著影响了规模化行为。在调整这些因素后,我们观察到合成图像在CLIP训练中表现出类似但略逊于真实图像的规模化趋势,而在训练监督图像分类器时表现明显不佳。我们的分析表明,造成这种性能不佳的主要原因是现成的文本到图像模型无法生成某些概念,这一限制严重影响了图像分类器的训练。我们的研究结果还表明,在以下情况下,扩展合成数据可能特别有效:(1)在监督问题中真实图像供应有限(例如,在ImageNet中少于500,000张图像),(2)评估数据集与训练数据显著不同,表明处于分布之外的情况,或(3)将合成数据与真实图像结合使用,如在训练CLIP模型中所示。
English
Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.
PDF80December 15, 2024