实现廉价扩展：一种用于更高分辨率适应的自级联扩散模型

摘要

扩散模型在图像和视频生成方面表现出高效性；然而，由于单尺度训练数据，它们在生成不同尺寸图像时仍面临构图挑战。调整大型预训练扩散模型以满足更高分辨率的需求需要大量计算和优化资源，但实现与低分辨率模型相媲美的生成能力仍然困难。本文提出了一种新颖的自级联扩散模型，利用从训练良好的低分辨率模型获得的丰富知识，快速适应更高分辨率图像和视频生成，采用无调整或廉价上采样器调整范式。通过集成一系列多尺度上采样器模块，自级联扩散模型可以高效地适应更高分辨率，保留原始构图和生成能力。我们进一步提出了一种基于中心引导的噪声重新调度策略，加快推断过程并改善局部结构细节。与完全微调相比，我们的方法实现了5倍的训练加速，并且仅需要额外的0.002M调整参数。大量实验证明，我们的方法可以通过仅微调10k步骤快速适应更高分辨率图像和视频合成，几乎不增加推断时间。

English

Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

实现廉价扩展：一种用于更高分辨率适应的自级联扩散模型

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

摘要

Support