ChatPaper.aiChatPaper

Diffusion-4K:基於潛在擴散模型的超高解析度影像合成

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

March 24, 2025
作者: Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang
cs.AI

摘要

本文提出Diffusion-4K,一種基於文本到圖像擴散模型的新型框架,用於直接生成超高分辨率圖像。核心創新包括:(1) Aesthetic-4K基準:針對公開可用的4K圖像合成數據集缺失的問題,我們構建了Aesthetic-4K,一個全面的超高分辨率圖像生成基準。我們精心挑選了高質量的4K圖像,並由GPT-4o生成相應的標題。此外,我們引入了GLCM分數和壓縮比指標來評估細節,結合FID、美學評分和CLIPScore等整體指標,對超高分辨率圖像進行全面評估。(2) 基於小波的微調:我們提出了一種基於小波的微調方法,適用於各種潛在擴散模型,直接使用逼真的4K圖像進行訓練,展示了其在生成高度細節的4K圖像中的有效性。因此,Diffusion-4K在高質量圖像合成和文本提示遵循方面表現出色,尤其是在現代大規模擴散模型(如SD3-2B和Flux-12B)的支持下。我們基準的大量實驗結果證明了Diffusion-4K在超高分辨率圖像合成中的優越性。
English
In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.

Summary

AI-Generated Summary

PDF62March 25, 2025