潜在空間における小さな一歩、ピクセルにとっての大きな飛躍：拡散モデルのための高速潜在アップスケールアダプター

要旨

拡散モデルは、直接的な高解像度サンプリングが遅くコストがかかるため、訓練解像度を超えるスケーリングに苦戦しており、事後の画像超解像（ISR）はデコード後に操作するため、アーティファクトや追加の遅延を引き起こします。本研究では、Latent Upscaler Adapter（LUA）を提案します。LUAは、最終的なVAEデコードステップの前に、生成器の潜在コード上で直接超解像を行う軽量モジュールです。LUAはドロップインコンポーネントとして統合され、ベースモデルや追加の拡散ステージに変更を加える必要がなく、潜在空間での単一のフィードフォワードパスを通じて高解像度合成を可能にします。スケール固有のピクセルシャッフルヘッドを備えた共有のSwinスタイルのバックボーンは、2倍および4倍のファクターをサポートし、画像空間SRベースラインとの互換性を維持し、ほぼ3倍低いデコードおよびアップスケーリング時間で同等の知覚品質を達成します（512pxから1024px生成の場合、SwinIRアーキテクチャを使用したピクセル空間SRの1.87秒に比べて、+0.42秒のみ追加）。さらに、LUAは異なるVAEの潜在空間にわたる強い汎化能力を示し、新しいデコーダごとにゼロから再訓練することなく容易に展開できます。広範な実験により、LUAはネイティブの高解像度生成の忠実度に非常に近い結果を示し、現代の拡散パイプラインにおいてスケーラブルで高忠実度の画像合成への実用的かつ効率的な道筋を提供することが実証されました。

English

Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

潜在空間における小さな一歩、ピクセルにとっての大きな飛躍：拡散モデルのための高速潜在アップスケールアダプター

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

要旨

Support