實現成本效益的擴展：自我串級擴散模型用於更高解析度的適應

摘要

擴散模型在圖像和視頻生成方面已被證明非常有效；然而，由於單一尺度訓練數據，它們在生成不同尺寸的圖像時仍面臨組合挑戰。將大型預訓練的擴散模型適應更高分辨率的需求，需要大量的計算和優化資源，但實現與低分辨率模型相媲美的生成能力仍然難以實現。本文提出了一種新穎的自我級聯擴散模型，利用從訓練良好的低分辨率模型獲得的豐富知識，快速適應更高分辨率的圖像和視頻生成，採用無調整或成本低廉的上採樣器調整範式。通過集成一系列多尺度上採樣器模塊，自我級聯擴散模型可以有效適應更高分辨率，保留原始的組合和生成能力。我們進一步提出了一種基於中心引導的噪聲重新安排策略，以加速推斷過程並改善局部結構細節。與完全微調相比，我們的方法實現了5倍的訓練加速，僅需要額外的0.002M調整參數。大量實驗表明，我們的方法可以通過僅微調10k步驟，快速適應更高分辨率的圖像和視頻合成，幾乎不需要額外的推斷時間。

English

Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

實現成本效益的擴展：自我串級擴散模型用於更高解析度的適應

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

摘要

Support