ChatPaper.aiChatPaper

DiP:像素空間中擴散模型的馴服之道

DiP: Taming Diffusion Models in Pixel Space

November 24, 2025
作者: Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai
cs.AI

摘要

擴散模型在生成品質與計算效率之間面臨根本性取捨。潛在擴散模型(LDM)雖提供高效解決方案,卻存在潛在資訊損失與非端到端訓練的缺陷。相比之下,現有像素空間模型雖繞過變分自編碼器(VAE),但在高解析度合成任務上面臨計算資源瓶頸。為解決此困境,我們提出DiP——一種高效的像素空間擴散框架。DiP將生成過程解耦為全域與局部兩階段:擴散轉換器(DiT)主幹網路透過處理大尺寸圖塊實現高效全域結構構建,同時協同訓練的輕量級圖塊細節修復頭則利用上下文特徵重建細粒度局部細節。此協同設計在不依賴VAE的前提下,實現了與LDM相媲美的計算效率。DiP在僅增加0.3%總參數量的情況下,推理速度較既有方法提升最高達10倍,並在ImageNet 256×256資料集上達成1.79的FID分數。
English
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10times faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256times256.
PDF181December 2, 2025