L2P: ピクセル生成の潜在能力を解き放つ

要旨

ピクセル拡散モデルは、近年、視覚生成において再び注目を集めている。しかしながら、高度なピクセル空間モデルをゼロから訓練するには、法外な計算リソースとデータリソースが必要となる。この課題に対処するため、我々はLatent-to-Pixel (L2P) 転送パラダイムを提案する。これは、事前学習済みLDMの豊富な知識を直接活用し、強力なピクセル空間モデルを構築する効率的なフレームワークである。具体的には、L2PはVAEを廃止し、大規模パッチトークン化を採用する。そして、ソースLDMの中間層を凍結し、潜像からピクセルへの変換を学習するための浅い層のみを訓練する。LDMが生成した合成画像のみを訓練コーパスとして利用することにより、L2Pは既に滑らかなデータ多様体に適合し、実データの収集を一切行わずに迅速な収束を実現する。この戦略により、L2Pはわずか8台のGPUを用いて、膨大な潜時事前知識をシームレスにピクセル空間へ移行できる。さらに、VAEのメモリボトルネックを排除することで、ネイティブな4K超解像度生成が可能になる。主流のLDMアーキテクチャにわたる広範な実験により、L2Pは無視できる程度の訓練オーバーヘッドしか発生させない一方、DPG-BenchではソースLDMと同等の性能を示し、GenEvalでは93%の性能に達することが示された。

English

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

L2P: ピクセル生成の潜在能力を解き放つ

L2P: Unlocking Latent Potential for Pixel Generation

要旨

Support