Een Kleine Stap in Latent, Een Reuzensprong voor Pixels: Snelle Latent Upscale Adapter voor Jouw Diffusiemodellen

Samenvatting

Diffusiemodellen hebben moeite om te schalen buiten hun trainingsresoluties, aangezien directe hoogwaardige sampling traag en kostbaar is, terwijl post-hoc beeld-superresolutie (ISR) artefacten introduceert en extra latentie veroorzaakt door te opereren na het decoderen. Wij presenteren de Latent Upscaler Adapter (LUA), een lichtgewicht module die superresolutie direct uitvoert op de latente code van de generator vóór de laatste VAE-decoderingstap. LUA integreert als een drop-in component, vereist geen aanpassingen aan het basismodel of extra diffusiefasen, en maakt hoogwaardige synthese mogelijk via een enkele feed-forward pass in de latente ruimte. Een gedeelde Swin-stijl backbone met schaalspecifieke pixel-shuffle heads ondersteunt 2x en 4x factoren en blijft compatibel met beeldruimte SR-baselines, waarbij vergelijkbare perceptuele kwaliteit wordt bereikt met bijna 3x lagere decodering- en upscalingtijd (slechts +0,42 s toevoegend voor 1024 px generatie vanuit 512 px, vergeleken met 1,87 s voor pixelruimte SR met dezelfde SwinIR-architectuur). Bovendien toont LUA sterke generalisatie over de latente ruimtes van verschillende VAEs, waardoor het eenvoudig te implementeren is zonder hertraining vanaf nul voor elke nieuwe decoder. Uitgebreide experimenten tonen aan dat LUA de kwaliteit van native hoogwaardige generatie nauwkeurig benadert, terwijl het een praktische en efficiënte weg biedt naar schaalbare, hoogwaardige beeld synthese in moderne diffusiepijplijnen.

English

Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

Een Kleine Stap in Latent, Een Reuzensprong voor Pixels: Snelle Latent Upscale Adapter voor Jouw Diffusiemodellen

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Samenvatting

Support