PixelGen: 픽셀 확산과 지각 손실을 통한 잠재 확산 모델 성능 향상

초록

픽셀 확산은 2단계 잠재 확산에서 VAE가 도입하는 인공적 결함과 병목 현상을 회피하며 픽셀 공간에서 직접 이미지를 종단 간 방식으로 생성합니다. 그러나 인식적으로 무관한 신호가 많은 고차원 픽셀 다양체를 최적화하는 것은 어려워, 기존 픽셀 확산 방법은 잠재 확산 모델에 뒤처져 왔습니다. 우리는 인식적 지도를 통한 간단한 픽셀 확산 프레임워크인 PixelGen을 제안합니다. PixelGen은 전체 이미지 다양체를 모델링하는 대신, 확산 모델이 더 의미 있는 인식적 다양체를 학습하도록 유도하는 두 가지 상호 보완적인 인식 손실을 도입합니다. LPIPS 손실은 더 나은 지역 패턴 학습을 용이하게 하고, DINO 기반 인식 손실은 전역 의미론을 강화합니다. 인식적 지도를 통해 PixelGen은 강력한 잠재 확산 기준선을 능가합니다. 분류자 없는 지도 없이 80 training epochs만으로 ImageNet-256에서 FID 5.11을 달성하며, GenEval 점수 0.79로 대규모 텍스트-이미지 생성에서 유리한 확장 성능을 보여줍니다. PixelGen은 VAE, 잠재 표현, 보조 단계가 필요 없어 더 간단하면서도 강력한 생성 패러다임을 제공합니다. 코드는 https://github.com/Zehong-Ma/PixelGen에서 공개되어 있습니다.

English

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.

PixelGen: 픽셀 확산과 지각 손실을 통한 잠재 확산 모델 성능 향상

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

초록

Support