PixNerd：像素神经场扩散

摘要

当前扩散变换器的成功在很大程度上依赖于预训练变分自编码器（VAE）所塑造的压缩潜在空间。然而，这种两阶段训练范式不可避免地引入了累积误差和解码伪影。为解决上述问题，研究人员回归到像素空间，但代价是复杂的级联流程和增加的标记复杂度。与他们的努力形成对比，我们提出利用神经场建模逐块解码，并展示了一种单尺度、单阶段、高效的端到端解决方案，称为像素神经场扩散（PixelNerd）。得益于PixNerd中高效的神经场表示，我们无需任何复杂级联流程或VAE，直接在ImageNet 256×256上实现了2.15的FID，在ImageNet 512×512上实现了2.84的FID。我们还将PixNerd框架扩展至文本到图像应用。我们的PixNerd-XXL/16在GenEval基准测试中取得了0.73的总体得分，在DPG基准测试中取得了80.9的总体得分，表现颇具竞争力。

English

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.