PiD: ピクセル拡散を用いた高速かつ高解像度の潜在デコーディング

要旨

近年の潜在外挿拡散モデルや自己回帰モデルを含む、実用的な高解像度テキスト・画像生成システムの多くは、コンパクトな潜在空間で生成を行い、デコーダーが生成された潜在表現をピクセルにマッピングする。しかし、潜在表現からピクセルへのデコーダーは再構成型であり、エンコーダーを反転させることに特化して最適化されており、詳細を合成するようには設計されていない。また、メガピクセル規模ではコストが増大する。この欠点に対処するため、より表現力が高く効率的な復号パラダイムが求められている。スケーラブルなピクセル空間拡散の近年の進展に着目し、我々はPiD（Pixel diffusion Decoder）を導入する。PiDは潜在表現の復号を条件付きピクセル拡散として再定義し、復号とアップサンプリングを一つの生成モジュールに統合する。高解像度のピクセル空間で直接ノイズ除去を行うことにより、PiDは4倍、さらには8倍にアップスケールされた画像を低レイテンシで合成する。潜在条件付けには、軽量なシグマ認識アダプターを用いてノイズが付加された潜在表現をピクセル拡散バックボーンに注入し、PiDが部分的にノイズ除去された潜在表現を復号できるようにし、潜在拡散プロセスを早期に終了させる。さらに効率を向上させるため、DMD2を用いてモデルを蒸留し、推論をわずか4ステップに削減する。PiDは従来のVAE潜在表現だけでなく、近年のRAEベースモデルで使用される意味的潜在表現（SigLIP、DINOv2など）にも適用可能である。PiDは512×512画像の潜在表現を、コンシューマ向けRTX 5090上で1秒未満、ピークメモリ13GBで2048×2048ピクセルに復号し、GB200 GPUでは210ミリ秒と、カスケード型拡散ベースの超解像パイプラインよりも約6倍高速であり、かつより優れた視覚的忠実度を実現する。

English

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4times and even 8times upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 times 512 images into 2048 times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6times faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.