關於基於擴散的文本生成圖像的可擴展性
On the Scalability of Diffusion-based Text-to-Image Generation
April 3, 2024
作者: Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto
cs.AI
摘要
對於LLM的演進,擴大模型和數據規模已經取得了相當成功的成果。然而,擴散型文本到圖像(T2I)模型的擴展法則尚未完全探索。如何有效地擴展模型以在降低成本的同時提高性能,目前仍不清楚。不同的訓練設置和昂貴的訓練成本使得進行公平的模型比較極為困難。在這項研究中,我們通過對去噪主幹和訓練集進行廣泛而嚴格的消融實驗,對擴散型T2I模型的擴展特性進行了實證研究,包括在高達6億張圖像的數據集上訓練範圍從0.4B到4B參數的UNet和Transformer變體。在模型擴展方面,我們發現跨注意力的位置和量是區分現有UNet設計性能的關鍵。增加Transformer塊對於改善文本-圖像對齊比增加通道數更具參數效率。然後,我們確定了一種高效的UNet變體,比SDXL的UNet小45%,速度快28%。在數據擴展方面,我們表明訓練集的質量和多樣性比僅僅數據集大小更為重要。增加標題密度和多樣性可以提高文本-圖像對齊性能和學習效率。最後,我們提供了用於預測文本-圖像對齊性能的擴展函數,這些函數是模型大小、計算和數據集大小的函數。
English
Scaling up model and data size has been quite successful for the evolution of
LLMs. However, the scaling law for the diffusion based text-to-image (T2I)
models is not fully explored. It is also unclear how to efficiently scale the
model for better performance at reduced cost. The different training settings
and expensive training cost make a fair model comparison extremely difficult.
In this work, we empirically study the scaling properties of diffusion based
T2I models by performing extensive and rigours ablations on scaling both
denoising backbones and training set, including training scaled UNet and
Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M
images. For model scaling, we find the location and amount of cross attention
distinguishes the performance of existing UNet designs. And increasing the
transformer blocks is more parameter-efficient for improving text-image
alignment than increasing channel numbers. We then identify an efficient UNet
variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data
scaling side, we show the quality and diversity of the training set matters
more than simply dataset size. Increasing caption density and diversity
improves text-image alignment performance and the learning efficiency. Finally,
we provide scaling functions to predict the text-image alignment performance as
functions of the scale of model size, compute and dataset size.Summary
AI-Generated Summary