Diffusion-4K: 잠재 확산 모델을 활용한 초고해상도 이미지 합성

초록

본 논문에서는 텍스트-이미지 확산 모델을 사용하여 직접 초고해상도 이미지를 합성하는 새로운 프레임워크인 Diffusion-4K를 소개한다. 주요 개선 사항은 다음과 같다: (1) Aesthetic-4K 벤치마크: 공개적으로 이용 가능한 4K 이미지 합성 데이터셋의 부재를 해결하기 위해, 우리는 초고해상도 이미지 생성을 위한 포괄적인 벤치마크인 Aesthetic-4K를 구축했다. GPT-4o로 생성된 신중하게 선별된 이미지와 캡션으로 구성된 고품질 4K 데이터셋을 정제했다. 또한, 미세한 디테일을 평가하기 위해 GLCM 점수와 압축 비율 지표를 도입하고, FID, Aesthetics, CLIPScore와 같은 종합적인 측정 지표와 결합하여 초고해상도 이미지를 포괄적으로 평가한다. (2) 웨이블릿 기반 미세 조정: 우리는 다양한 잠재 확산 모델에 적용 가능한 사실적인 4K 이미지를 직접 학습하기 위한 웨이블릿 기반 미세 조정 접근법을 제안하며, 이를 통해 고도로 디테일한 4K 이미지 합성의 효과를 입증한다. 결과적으로, Diffusion-4K는 특히 최신 대규모 확산 모델(예: SD3-2B 및 Flux-12B)을 기반으로 할 때 고품질 이미지 합성과 텍스트 프롬프트 준수에서 인상적인 성능을 달성한다. 우리의 벤치마크에서 얻은 광범위한 실험 결과는 Diffusion-4K가 초고해상도 이미지 합성에서 우수성을 보임을 입증한다.

English

In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.