L2P: 픽셀 생성을 위한 잠재력 활용

초록

픽셀 확산 모델은 최근 시각 생성 분야에서 다시 주목받고 있다. 그러나 고급 픽셀 공간 모델을 처음부터 학습시키려면 엄청난 연산 및 데이터 자원이 요구된다. 이러한 문제를 해결하기 위해, 우리는 사전 학습된 LDM의 풍부한 지식을 직접 활용하여 강력한 픽셀 공간 모델을 구축하는 효율적 프레임워크인 잠재-픽셀(L2P) 전이 패러다임을 제안한다. 구체적으로, L2P는 VAE를 대신하여 대형 패치 토큰화를 채택하고, 원본 LDM의 중간 계층은 고정한 채 얕은 계층만을 학습시켜 잠재-픽셀 변환을 수행한다. LDM이 생성한 합성 이미지를 유일한 학습 코퍼스로 활용함으로써, L2P는 이미 매끄러운 데이터 다양체에 적합하여 실제 데이터 수집 없이도 빠른 수렴이 가능하다. 이 전략을 통해 L2P는 단 8개의 GPU만으로 방대한 잠재 사전 지식을 픽셀 공간으로 원활하게 이전할 수 있다. 또한, VAE 메모리 병목 현상을 제거하여 기본적인 4K 초고해상도 생성을 가능하게 한다. 주류 LDM 아키텍처에 걸친 광범위한 실험 결과, L2P는 무시할 수 있는 학습 오버헤드만을 발생시키면서도 DPG-Bench에서 원본 LDM과 동등한 성능을 보이고 GenEval에서 93%의 성능에 도달함을 보여준다.

English

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

L2P: 픽셀 생성을 위한 잠재력 활용

L2P: Unlocking Latent Potential for Pixel Generation

초록

Support