拡散モデルを用いた高解像度画像生成のための潜在空間超解像

要旨

本論文では、潜在空間における超解像を直接活用することで、高解像度（1Kを超える）画像生成を実現する新しいフレームワークLSRNAを提案する。既存の拡散モデルは、学習解像度を超えたスケーリングに苦戦し、構造的な歪みや内容の繰り返しが生じることが多い。参照ベースの手法は、低解像度の参照画像をアップサンプリングして高解像度生成をガイドすることでこれらの問題に対処する。しかし、潜在空間でのアップサンプリングは多様体のずれを引き起こし、出力品質を低下させることが多い。一方、RGB空間でのアップサンプリングは過度に平滑化された出力を生成しがちである。これらの制限を克服するため、LSRNAは多様体整合のための潜在空間超解像（LSR）と高周波ディテールを強化する領域ごとのノイズ追加（RNA）を組み合わせる。我々の広範な実験により、LSRNAの統合が様々な解像度と評価指標において最先端の参照ベース手法を凌駕し、潜在空間アップサンプリングがディテールとシャープネスを保持する上で重要な役割を果たすことが実証された。コードはhttps://github.com/3587jjh/LSRNAで公開されている。

English

In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at https://github.com/3587jjh/LSRNA.

拡散モデルを用いた高解像度画像生成のための潜在空間超解像

Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

要旨

Support