L2P:解鎖像素生成的潛在潛能
L2P: Unlocking Latent Potential for Pixel Generation
May 12, 2026
作者: Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, Ying Tai
cs.AI
摘要
像素擴散模型近期在視覺生成領域重新獲得關注。然而,從零訓練先進的像素空間模型需要昂貴的計算與資料資源。為解決此問題,我們提出潛在到像素(L2P)轉移範式,這是一個有效利用預訓練LDM豐富知識來建構強大像素空間模型的框架。具體而言,L2P摒棄VAE,改用大區塊標記化,並凍結源LDM的中間層,僅訓練淺層來學習潛在到像素的轉換。透過僅使用LDM生成的合成圖像作為訓練語料,L2P擬合了一個本就平滑的數據流形,從而實現快速收斂且無需任何真實資料收集。此策略使L2P能夠僅用8張GPU便將大量潛在先驗無縫遷移至像素空間。此外,消除VAE記憶體瓶頸更釋放出原生4K超高解析度生成能力。在主流LDM架構上的大量實驗顯示,L2P的訓練開銷微不足道,卻在DPG-Bench上與源LDM表現相當,並在GenEval上達到93%的表現。
English
Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.