ChatPaper.aiChatPaper

像素扩散变换器:用于图像生成的像素扩散变换模型

PixelDiT: Pixel Diffusion Transformers for Image Generation

November 25, 2025
作者: Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo
cs.AI

摘要

潜空间建模一直是扩散变换器(DiT)的标准范式。然而,该方法依赖包含预训练自编码器的两阶段流程,其有损重构特性会导致误差累积,并阻碍联合优化。为解决这些问题,我们提出PixelDiT——一种无需自编码器的单阶段端到端模型,直接在像素空间学习扩散过程。该模型采用双层级设计的全变换器架构:捕捉全局语义的块级DiT与优化纹理细节的像素级DiT协同工作,在保持精细细节的同时实现像素空间扩散模型的高效训练。分析表明,有效的像素级令牌建模是像素扩散成功的关键。PixelDiT在ImageNet 256×256数据集上取得1.61的FID分数,显著超越现有像素生成模型。我们进一步将PixelDiT扩展至文本到图像生成领域,并在像素空间完成1024×1024分辨率的预训练,其GenEval得分0.74、DPG-bench得分83.5,已接近最佳潜扩散模型性能。
English
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
PDF131December 4, 2025