ChatPaper.aiChatPaper

DiP:像素空间中扩散模型的驯服之道

DiP: Taming Diffusion Models in Pixel Space

November 24, 2025
作者: Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai
cs.AI

摘要

扩散模型在生成质量与计算效率之间存在根本性权衡。潜在扩散模型(LDM)虽提供高效解决方案,但存在信息丢失风险且无法进行端到端训练。相比之下,现有像素空间模型绕过了变分自编码器(VAE),但在高分辨率合成场景下计算成本过高。为解决这一困境,我们提出DiP——一种高效的像素空间扩散框架。DiP将生成过程解耦为全局与局部两阶段:采用扩散Transformer(DiT)主干网络对大尺寸图像块进行高效全局结构构建,同时通过协同训练的轻量级细节修复头利用上下文特征恢复细粒度局部细节。这种协同设计在不依赖VAE的情况下实现了与LDM相当的计算效率。DiP在仅增加0.3%参数总量的前提下,推理速度较现有方法提升最高达10倍,并在ImageNet 256×256数据集上取得了1.79的FID分数。
English
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10times faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256times256.
PDF181December 2, 2025