Diffusion-4K：潜在拡散モデルによる超高解像度画像合成

要旨

本論文では、テキストから画像を生成する拡散モデルを用いた直接的な超高解像度画像合成のための新たなフレームワーク、Diffusion-4Kを提案する。主な進展は以下の通りである：(1) Aesthetic-4Kベンチマーク：公開されている4K画像合成データセットの欠如に対処するため、GPT-4oによって生成された厳選された画像とキャプションから構成された高品質な4KデータセットであるAesthetic-4Kを構築した。さらに、細部の評価のためにGLCMスコアと圧縮率の指標を導入し、FID、Aesthetics、CLIPScoreなどの包括的な評価指標と組み合わせて、超高解像度画像の総合的な評価を行った。(2) ウェーブレットベースのファインチューニング：フォトリアルな4K画像を用いた直接的な学習のためのウェーブレットベースのファインチューニング手法を提案し、様々な潜在拡散モデルに適用可能であることを示し、高度に詳細な4K画像の合成における有効性を実証した。その結果、Diffusion-4Kは、特に現代の大規模拡散モデル（例：SD3-2BやFlux-12B）を活用した場合、高品質な画像合成とテキストプロンプトへの忠実性において印象的な性能を達成した。我々のベンチマークによる広範な実験結果は、Diffusion-4Kが超高解像度画像合成において優れていることを示している。

English

In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.