잠재 공간의 작은 한 걸음, 픽셀을 위한 거대한 도약: 확산 모델을 위한 고속 잠재 업스케일 어댑터

초록

확산 모델은 고해상도 샘플링이 느리고 비용이 많이 들며, 사후 이미지 초해상도(ISR)는 디코딩 후에 동작함으로써 아티팩트와 추가 지연을 유발하기 때문에 훈련 해상도를 넘어 확장하는 데 어려움을 겪습니다. 본 논문에서는 최종 VAE 디코딩 단계 이전에 생성기의 잠재 코드에서 직접 초해상도를 수행하는 경량 모듈인 Latent Upscaler Adapter(LUA)를 제안합니다. LUA는 기본 모델이나 추가 확산 단계를 수정할 필요 없이 드롭인 구성 요소로 통합되며, 잠재 공간에서 단일 순방향 패스로 고해상도 합성을 가능하게 합니다. 스케일별 픽셀 셔플 헤드를 갖춘 공유 Swin 스타일 백본은 2배 및 4배 스케일링을 지원하며 이미지 공간 SR 베이스라인과 호환되며, 동일한 SwinIR 아키텍처를 사용한 픽셀 공간 SR의 1.87초에 비해 512px에서 1024px 생성 시 디코딩 및 업스케일링 시간을 거의 3배 낮추면서(+0.42초만 추가) 비슷한 지각 품질을 달성합니다. 또한, LUA는 다양한 VAE의 잠재 공간에서 강력한 일반화 능력을 보여주어, 각각의 새로운 디코더에 대해 처음부터 재훈련할 필요 없이 쉽게 배포할 수 있습니다. 광범위한 실험을 통해 LUA는 고유의 고해상도 생성의 충실도를 거의 유지하면서 현대 확산 파이프라인에서 확장 가능하고 고충실도의 이미지 합성을 위한 실용적이고 효율적인 경로를 제공함을 입증합니다.

English

Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

잠재 공간의 작은 한 걸음, 픽셀을 위한 거대한 도약: 확산 모델을 위한 고속 잠재 업스케일 어댑터

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

초록

Support