ChatPaper.aiChatPaper

PixelDiT:基于像素扩散变换器的图像生成模型

PixelDiT: Pixel Diffusion Transformers for Image Generation

November 25, 2025
作者: Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo
cs.AI

摘要

潛空間建模一直是擴散變換器(DiTs)的標準範式,但其依賴包含預訓練自編碼器的兩階段流程,這會引入有損重建問題,導致誤差累積並阻礙聯合優化。為解決這些問題,我們提出PixelDiT——一種無需自編碼器的單階段端到端模型,直接在像素空間學習擴散過程。PixelDiT採用基於雙層級設計的全變換器架構:捕捉全局語義的補丁級DiT與精修紋理細節的像素級DiT協同工作,在保持精細細節的同時實現像素空間擴散模型的高效訓練。我們的分析表明,有效的像素級令牌建模是像素擴散成功的關鍵。PixelDiT在ImageNet 256×256數據集上取得1.61的FID分數,大幅超越現有像素生成模型。我們進一步將PixelDiT擴展至文本到圖像生成任務,並在1024×1024分辨率下進行像素空間預訓練,其GenEval得分達0.74,DPG-bench得分達83.5,接近最佳潛在擴散模型性能。
English
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
PDF131December 4, 2025