L2P: 解锁像素生成的潜在潜力
L2P: Unlocking Latent Potential for Pixel Generation
May 12, 2026
作者: Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, Ying Tai
cs.AI
摘要
像素扩散模型近期在视觉生成领域重新受到关注。然而,从零开始训练先进的像素空间模型需要极高的算力和数据资源。为解决这一问题,我们提出隐空间到像素(L2P)迁移范式——一种直接利用预训练LDM丰富知识构建强大像素空间模型的高效框架。具体而言,L2P摒弃VAE而采用大块标记化技术,冻结源LDM的中间层,仅训练浅层网络学习隐空间到像素的转换。通过仅使用LDM生成的合成图像作为训练语料,L2P拟合了本就平滑的数据流形,实现了零真实数据采集下的快速收敛。该策略使L2P仅需8块GPU即可将海量隐空间先验无缝迁移至像素空间。此外,消除VAE内存瓶颈后,本框架原生支持4K超高清分辨率生成。在主流LDM架构上的大量实验表明,L2P产生的训练开销可忽略不计,其在DPG-Bench基准上达到与源LDM相当的性能,并在GenEval基准上实现93%的性能表现。
English
Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.