PiD：基于像素扩散的快速高分辨率潜在解码

摘要

大多数实用的高分辨率文本到图像系统（包括潜在扩散模型和自回归模型）都在紧凑的潜在空间中进行生成，并通过解码器将生成的潜在表示映射回像素。然而，这种潜在到像素的解码器是面向重构的，其优化目标在于反转编码器而非合成更多细节，并且在百万像素级别下成本急剧上升。这一缺陷促使我们需要一种更具表现力且更高效的解码范式。受近期像素级扩散模型可扩展性进展的启发，我们提出了PiD（像素扩散解码器），它将潜在解码重构为条件像素扩散，将解码与上采样统一为单一生成模块。通过在像素空间直接去噪，PiD能合成4倍甚至8倍上采样的图像，且延迟极低。在潜在条件处理方面，一个轻量级的sigma感知适配器将受噪声扰动的潜在表示注入像素扩散主干，使得PiD能够解码部分去噪的潜在表示，从而提前终止潜在扩散过程。为进一步提升效率，我们采用DMD2对模型进行蒸馏，将推理步骤压缩至仅4步。PiD既适用于传统的VAE潜在表示，也适用于近期基于RAE模型中的语义潜在表示（如SigLIP、DINOv2）。PiD可将512×512图像的潜在表示解码为2048×2048像素，在消费级RTX 5090上仅需不到1秒，峰值内存13 GB；在GB200 GPU上最快可达210毫秒，相比基于级联扩散的超分辨率流水线，速度快约6倍且视觉保真度更优。

English

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4times and even 8times upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 times 512 images into 2048 times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6times faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.