ChatPaper.aiChatPaper

潛在空間一小步,像素世界一大躍:為您的擴散模型打造快速潛在升頻適配器

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

November 13, 2025
作者: Aleksandr Razin, Danil Kazantsev, Ilya Makarov
cs.AI

摘要

擴散模型在超越其訓練分辨率時面臨挑戰,因為直接進行高分辨率採樣既緩慢又昂貴,而事後圖像超分辨率(ISR)則在解碼後操作,引入了偽影並增加了額外的延遲。我們提出了潛在放大適配器(Latent Upscaler Adapter, LUA),這是一個輕量級模塊,它在最終的VAE解碼步驟之前直接在生成器的潛在代碼上執行超分辨率。LUA作為一個即插即用的組件集成,無需對基礎模型進行修改或增加額外的擴散階段,並通過在潛在空間中的單次前向傳播實現高分辨率合成。一個共享的Swin風格骨幹,配備了特定尺度的像素重排頭,支持2倍和4倍的放大因子,並與圖像空間的超分辨率基線保持兼容,在解碼和放大時間上實現了近3倍的降低(從512像素生成1024像素僅增加+0.42秒,而使用相同SwinIR架構的像素空間超分辨率則需要1.87秒)。此外,LUA在不同VAE的潛在空間中表現出強大的泛化能力,使其易於部署,而無需為每個新解碼器從頭開始重新訓練。大量實驗表明,LUA在保持原生高分辨率生成的保真度的同時,為現代擴散管道中的可擴展、高保真圖像合成提供了一條實用且高效的路徑。
English
Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.
PDF483November 15, 2025