ZipIR: 高解像度画像復元のための潜在ピラミッド拡散トランスフォーマー

要旨

近年の生成モデルの進展により、特に強力な拡散モデルを通じて、意味的詳細や局所的な忠実度の驚異的な回復が可能となり、画像修復能力が大幅に向上しました。しかし、超高解像度でのこれらのモデルの展開においては、長距離アテンションメカニズムの計算負荷により、品質と効率性の間で重大なトレードオフが生じています。この問題に対処するため、我々はZipIRを提案します。これは、高解像度画像修復のための効率性、拡張性、および長距離モデリングを強化する新しいフレームワークです。ZipIRは、画像を32倍に圧縮する高度に圧縮された潜在表現を採用し、空間トークンの数を効果的に削減し、Diffusion Transformer（DiT）のような高容量モデルの使用を可能にします。この目標に向けて、我々は潜在空間をサブバンドに構造化して拡散学習を容易にするLatent Pyramid VAE（LP-VAE）設計を提案します。2K解像度までのフル画像で学習されたZipIRは、既存の拡散ベースの手法を凌駕し、深刻に劣化した入力からの高解像度画像修復において、比類のない速度と品質を提供します。

English

Recent progress in generative models has significantly improved image restoration capabilities, particularly through powerful diffusion models that offer remarkable recovery of semantic details and local fidelity. However, deploying these models at ultra-high resolutions faces a critical trade-off between quality and efficiency due to the computational demands of long-range attention mechanisms. To address this, we introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration. ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens, and enabling the use of high-capacity models like the Diffusion Transformer (DiT). Toward this goal, we propose a Latent Pyramid VAE (LP-VAE) design that structures the latent space into sub-bands to ease diffusion training. Trained on full images up to 2K resolution, ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.

ZipIR: 高解像度画像復元のための潜在ピラミッド拡散トランスフォーマー

ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

要旨

Support