저비용 스케일링: 고해상도 적응을 위한 자기-캐스케이드 확산 모델

초록

확산 모델(Diffusion Model)은 이미지 및 비디오 생성 분야에서 매우 효과적인 것으로 입증되었으나, 단일 스케일 학습 데이터로 인해 다양한 크기의 이미지를 생성할 때 여전히 구성(composition) 문제에 직면하고 있습니다. 고해상도 생성을 위해 대규모 사전 학습된 확산 모델을 적용하려면 상당한 계산 및 최적화 자원이 필요하지만, 저해상도 모델과 비슷한 수준의 생성 능력을 달성하는 것은 여전히 어려운 과제입니다. 본 논문은 잘 학습된 저해상도 모델에서 얻은 풍부한 지식을 활용하여 고해상도 이미지 및 비디오 생성에 빠르게 적응할 수 있는 새로운 자기-캐스케이드(self-cascade) 확산 모델을 제안합니다. 이 모델은 튜닝이 필요 없거나 저렴한 업샘플러 튜닝 패러다임을 사용하며, 다중 스케일 업샘플러 모듈 시퀀스를 통합함으로써 원래의 구성 및 생성 능력을 유지하면서 고해상도에 효율적으로 적응할 수 있습니다. 또한, 추론 과정을 가속화하고 지역적 구조적 세부 사항을 개선하기 위해 피벗 가이드 노이즈 재스케줄링 전략을 제안합니다. 전체 미세 조정(full fine-tuning)과 비교했을 때, 우리의 접근 방식은 학습 속도를 5배 향상시키며 추가로 0.002M의 튜닝 파라미터만 필요로 합니다. 광범위한 실험을 통해 우리의 접근 방식이 단 10,000단계의 미세 조정만으로도 고해상도 이미지 및 비디오 합성에 빠르게 적응할 수 있으며, 추가 추론 시간이 거의 없음을 입증했습니다.

English

Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

저비용 스케일링: 고해상도 적응을 위한 자기-캐스케이드 확산 모델

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

초록

Support