貪婪成長技術實現高解析度基於像素的擴散模型。
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
May 27, 2024
作者: Cristina N. Vasconcelos, Abdullah Rashwan Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang
cs.AI
摘要
我們解決了學習如何在規模上學習有效的基於像素的圖像擴散模型的長期問題,引入了一種非常簡單的貪婪增長方法,用於穩定訓練大規模、高分辨率模型,而無需級聯的超分辨率組件。關鍵見解源於對核心組件的精心預訓練,即負責文本到圖像對齊和高分辨率渲染的組件。我們首先展示了擴展 Shallow UNet 的好處,沒有下(上)採樣的編(解)碼器。擴展其深層核心層被證明可以改善對齊、對象結構和構圖。基於這個核心模型,我們提出了一種貪婪算法,將架構擴展到高分辨率端對端模型,同時保持預訓練表示的完整性,穩定訓練,並減少對大型高分辨率數據集的需求。這使得一個單階段模型能夠生成高分辨率圖像,而無需超分辨率級聯。我們的主要結果依賴於公共數據集,並顯示我們能夠訓練高達 80 億參數的非級聯模型,而無需進一步的正則化方案。Vermeer,我們的完整管道模型使用內部數據集進行訓練以生成 1024x1024 圖像,無需級聯,被 44.0% 的人類評估者更喜歡,而 SDXL 則為 21.4%。
English
We address the long-standing problem of how to learn effective pixel-based
image diffusion models at scale, introducing a remarkably simple greedy growing
method for stable training of large-scale, high-resolution models. without the
needs for cascaded super-resolution components. The key insight stems from
careful pre-training of core components, namely, those responsible for
text-to-image alignment {\it vs.} high-resolution rendering. We first
demonstrate the benefits of scaling a {\it Shallow UNet}, with no
down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to
improve alignment, object structure, and composition. Building on this core
model, we propose a greedy algorithm that grows the architecture into
high-resolution end-to-end models, while preserving the integrity of the
pre-trained representation, stabilizing training, and reducing the need for
large high-resolution datasets. This enables a single stage model capable of
generating high-resolution images without the need of a super-resolution
cascade. Our key results rely on public datasets and show that we are able to
train non-cascaded models up to 8B parameters with no further regularization
schemes. Vermeer, our full pipeline model trained with internal datasets to
produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4%
human evaluators over SDXL.Summary
AI-Generated Summary