L2P：解鎖像素生成的潛在潛能

摘要

像素擴散模型近期在視覺生成領域重新獲得關注。然而，從零訓練先進的像素空間模型需要昂貴的計算與資料資源。為解決此問題，我們提出潛在到像素（L2P）轉移範式，這是一個有效利用預訓練LDM豐富知識來建構強大像素空間模型的框架。具體而言，L2P摒棄VAE，改用大區塊標記化，並凍結源LDM的中間層，僅訓練淺層來學習潛在到像素的轉換。透過僅使用LDM生成的合成圖像作為訓練語料，L2P擬合了一個本就平滑的數據流形，從而實現快速收斂且無需任何真實資料收集。此策略使L2P能夠僅用8張GPU便將大量潛在先驗無縫遷移至像素空間。此外，消除VAE記憶體瓶頸更釋放出原生4K超高解析度生成能力。在主流LDM架構上的大量實驗顯示，L2P的訓練開銷微不足道，卻在DPG-Bench上與源LDM表現相當，並在GenEval上達到93%的表現。

English

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

L2P：解鎖像素生成的潛在潛能

L2P: Unlocking Latent Potential for Pixel Generation

摘要

Support