贪婪增长实现了基于像素的高分辨率扩散模型。
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
May 27, 2024
作者: Cristina N. Vasconcelos, Abdullah Rashwan Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang
cs.AI
摘要
我们解决了一个长期存在的问题,即如何在规模上学习有效的基于像素的图像扩散模型,引入了一种非常简单的贪婪增长方法,用于稳定训练大规模、高分辨率模型,无需级联超分辨率组件。关键洞察力源自对核心组件的精心预训练,即负责文本到图像对齐和高分辨率渲染的组件。我们首先展示了扩展 Shallow UNet 的好处,没有下(上)采样的编码(解码)器。扩展其深层核心层被证明可以改善对齐、对象结构和组合。基于这个核心模型,我们提出了一种贪婪算法,将架构扩展到高分辨率端到端模型,同时保持预训练表示的完整性,稳定训练,并减少对大型高分辨率数据集的需求。这使得能够生成高分辨率图像的单阶段模型无需超分辨率级联。我们的关键结果依赖于公共数据集,并显示我们能够训练高达 80 亿参数的非级联模型,无需进一步的正则化方案。Vermeer,我们的完整流水线模型经过内部数据集训练,能够生成 1024x1024 图像,无级联,被 44.0% 的人类评估者优先于 SDXL 的 21.4%。
English
We address the long-standing problem of how to learn effective pixel-based
image diffusion models at scale, introducing a remarkably simple greedy growing
method for stable training of large-scale, high-resolution models. without the
needs for cascaded super-resolution components. The key insight stems from
careful pre-training of core components, namely, those responsible for
text-to-image alignment {\it vs.} high-resolution rendering. We first
demonstrate the benefits of scaling a {\it Shallow UNet}, with no
down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to
improve alignment, object structure, and composition. Building on this core
model, we propose a greedy algorithm that grows the architecture into
high-resolution end-to-end models, while preserving the integrity of the
pre-trained representation, stabilizing training, and reducing the need for
large high-resolution datasets. This enables a single stage model capable of
generating high-resolution images without the need of a super-resolution
cascade. Our key results rely on public datasets and show that we are able to
train non-cascaded models up to 8B parameters with no further regularization
schemes. Vermeer, our full pipeline model trained with internal datasets to
produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4%
human evaluators over SDXL.Summary
AI-Generated Summary