PiD：利用像素擴散的快速高解析度潛在解碼

摘要

目前大多數實用的高解析度文字轉圖像系統（包括潛在擴散模型與自回歸模型）均在緊湊的潛在空間中進行生成，並透過解碼器將產生的潛在特徵映射回像素。然而，這種潛在-像素解碼器本質上是以重建為導向，其最佳化目標在於逆轉編碼器，而非合成更多細節，且在高百萬像素尺度下成本日益攀升。此缺陷促使我們需要更具表現力且更高效的解碼機制。受近期可擴展像素空間擴散模型進展的啟發，我們提出PiD（Pixel diffusion Decoder），將潛在解碼重新定義為條件式像素擴散，從而將解碼與上採樣整合為單一生成模組。透過直接在高效解析度像素空間中進行去噪，PiD能以低延遲合成放大4倍甚至8倍的影像。在潛在條件處理方面，輕量級的sigma感知適配器將受噪聲干擾的潛在特徵注入像素擴散主幹，使PiD能夠解碼部分去噪的潛在特徵，並提前終止潛在擴散過程。為進一步提升效率，我們利用DMD2對模型進行蒸餾，將推論步驟縮減至僅4步。PiD可應用於傳統VAE潛在特徵，以及近期基於RAE模型所使用的語義潛在特徵（如SigLIP、DINOv2）。在消費級RTX 5090顯示卡上，PiD可於1秒內將512×512影像的潛在特徵解碼為2048×2048像素，峰值記憶體僅13 GB；在GB200 GPU上更可快至210毫秒，速度約為串聯式擴散超解析度管線的6倍，同時具備更佳的視覺保真度。

English

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4times and even 8times upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 times 512 images into 2048 times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6times faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.