モデル訓練のための合成画像のスケーリング法則...現状において

要旨

近年のテキストから画像へのモデルの著しい進展により、合成画像を用いて視覚システムを訓練する可能性が開かれ、大規模なキュレーションデータの収集の難しさを克服する可能性がある。しかし、より多くの合成データが訓練セットに追加されるにつれて、これらのモデルがどのように振る舞うかは明らかではない。本論文では、最先端のテキストから画像へのモデルによって生成された合成画像のスケーリング則を、教師ありモデルの訓練のために研究する：ラベル監視付きの画像分類器と、言語監視付きのCLIPである。テキストプロンプト、分類器なしガイダンススケール、およびテキストから画像へのモデルの種類など、スケーリング挙動に大きく影響するいくつかの要因を特定する。これらの要因を調整した後、合成画像はCLIP訓練において、実画像と同様の、しかしやや効果の低いスケーリング傾向を示す一方、教師あり画像分類器の訓練においてはスケーリングで著しく劣ることを観察する。我々の分析は、この低性能の主な理由が、既存のテキストから画像へのモデルが特定の概念を生成できないことであり、これが画像分類器の訓練を著しく損なう制限であることを示している。我々の知見はまた、合成データのスケーリングが以下のようなシナリオで特に有効であることを示唆している：（1）教師あり問題に対して実画像の供給が限られている場合（例：ImageNetで50万枚未満）、（2）評価データセットが訓練データから大きく乖離している場合、すなわち分布外シナリオを示す場合、または（3）合成データが実画像と併用される場合、CLIPモデルの訓練で示されたように。

English

Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.

モデル訓練のための合成画像のスケーリング法則...現状において

Scaling Laws of Synthetic Images for Model Training ... for Now

要旨

Support