PiD: 픽셀 확산을 통한 빠르고 고해상도의 잠재 디코딩

초록

대부분의 실용적인 고해상도 텍스트-이미지 시스템(잠재 확산 및 자기회귀 모델 포함)은 압축된 잠재 공간에서 생성을 수행하며, 디코더가 생성된 잠재 변수를 다시 픽셀에 매핑한다. 그러나 잠재-픽셀 디코더는 재구성 지향적이어서, 보다 상세한 정보를 합성하기보다는 인코더를 반전시키는 데 최적화되어 있으며, 메가픽셀 규모에서는 비용이 점점 증가한다. 이러한 단점은 더욱 표현력 있고 효율적인 디코딩 패러다임을 필요로 한다. 확장 가능한 픽셀 공간 확산 분야의 최근 진전에 착안하여, 본 연구에서는 잠재 디코딩을 조건부 픽셀 확산으로 재정립하고, 디코딩과 업스케일링을 하나의 생성 모듈로 통합하는 Pixel Diffusion Decoder(PiD)를 제안한다. 고해상도 픽셀 공간에서 직접 잡음을 제거하는 PiD는 4배, 나아가 8배까지 업스케일링된 이미지를 낮은 지연 시간으로 합성한다. 잠재 조건화를 위해, 가벼운 시그마 적응 어댑터가 잡음에 오염된 잠재 변수를 픽셀 확산 백본에 주입함으로써, PiD가 부분적으로 잡음이 제거된 잠재 변수를 디코딩하고 잠재 확산 과정을 조기에 종료할 수 있게 한다. 효율성을 더욱 개선하기 위해 DMD2를 사용하여 모델을 증류하였고, 추론 단계를 4단계로 줄였다. PiD는 기존 VAE 잠재 변수는 물론, 최근 RAE 기반 모델에서 사용되는 의미적 잠재 변수(SigLIP, DINOv2 등)에도 적용 가능하다. PiD는 512×512 이미지의 잠재 변수를 2048×2048 픽셀로 디코딩하는 데 소비자용 RTX 5090에서 13GB의 최대 메모리를 사용하여 1초 이내에 처리하며, GB200 GPU에서는 최대 210ms로 처리하여, 캐스케이드 확산 기반 초해상도 파이프라인보다 약 6배 빠르면서도 더 나은 시각적 충실도를 제공한다.

English

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4times and even 8times upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 times 512 images into 2048 times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6times faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.