PixelGen：像素扩散模型凭借感知损失超越隐式扩散技术

摘要

像素扩散技术以端到端方式直接在像素空间生成图像，避免了双阶段潜在扩散中VAE引入的伪影与瓶颈。然而，高维像素流形包含大量感知无关信号，其优化极具挑战性，导致现有像素扩散方法始终落后于潜在扩散模型。我们提出PixelGen——一种配备感知监督的简易像素扩散框架。该框架不再对完整图像流形建模，而是引入两种互补的感知损失来引导扩散模型学习更具意义的感知流形：LPIPS损失促进局部模式学习，基于DINO的感知损失则强化全局语义。通过感知监督，PixelGen超越了强力的潜在扩散基线——在无需分类器引导的情况下，仅用80训练轮次即在ImageNet-256上达到5.11的FID分数；在大规模文生图任务中展现出优异的扩展性能，获得0.79的GenEval评分。该方案无需VAE、潜在表示或辅助阶段，构建了更简洁却更强大的生成范式。代码已开源：https://github.com/Zehong-Ma/PixelGen。

English

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.

PixelGen：像素扩散模型凭借感知损失超越隐式扩散技术

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

摘要

Support