다중 해상도 확산 모델을 통한 이미지 생성의 왜곡 완화

초록

본 논문은 새로운 다중 해상도 네트워크와 시간 의존적 레이어 정규화를 통합하여 확산 모델에 혁신적인 개선을 제시합니다. 확산 모델은 고화질 이미지 생성에서의 효과성으로 인해 주목받고 있습니다. 기존의 접근 방식은 컨볼루션 U-Net 아키텍처에 의존해 왔으나, 최근 트랜스포머 기반 설계가 더 우수한 성능과 확장성을 입증했습니다. 그러나 입력 데이터를 토큰화(패치화)하는 트랜스포머 아키텍처는 토큰 길이에 대한 자기 주의 연산의 이차적 특성으로 인해 시각적 충실도와 계산 복잡성 사이의 트레이드오프에 직면합니다. 더 큰 패치 크기는 주의 연산 효율성을 가능하게 하지만, 미세한 시각적 세부 사항을 포착하는 데 어려움을 겪어 이미지 왜곡을 초래합니다. 이러한 문제를 해결하기 위해, 우리는 다중 해상도 네트워크(DiMR)를 확산 모델에 통합하는 프레임워크를 제안합니다. 이 프레임워크는 여러 해상도에서 특징을 정제하며, 저해상도에서 고해상도로 점진적으로 세부 사항을 향상시킵니다. 또한, 시간 의존적 레이어 정규화(TD-LN)를 도입하여 시간 의존적 매개변수를 레이어 정규화에 통합함으로써 시간 정보를 주입하고 더 우수한 성능을 달성하는 파라미터 효율적 접근 방식을 제시합니다. 우리의 방법의 효율성은 클래스 조건부 ImageNet 생성 벤치마크에서 입증되었으며, DiMR-XL 변형은 기존의 확산 모델을 능가하여 ImageNet 256 x 256에서 1.70, ImageNet 512 x 512에서 2.89의 새로운 최첨단 FID 점수를 기록했습니다. 프로젝트 페이지: https://qihao067.github.io/projects/DiMR

English

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: https://qihao067.github.io/projects/DiMR

다중 해상도 확산 모델을 통한 이미지 생성의 왜곡 완화

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

초록

Support