ChatPaper.aiChatPaper

返璞归真:让去噪生成模型专注去噪

Back to Basics: Let Denoising Generative Models Denoise

November 17, 2025
作者: Tianhong Li, Kaiming He
cs.AI

摘要

当今的去噪扩散模型并非传统意义上的"去噪",即它们并不直接预测干净图像。相反,神经网络预测的是噪声或含噪量。本文提出,预测干净数据与预测含噪量存在本质区别。根据流形假设,自然数据应位于低维流形上,而含噪量则不然。基于此假设,我们倡导直接预测干净数据的模型,这使得表观容量不足的网络能在极高维空间中有效运作。我们证明,在像素层面使用简单的大尺寸补丁Transformer即可成为强大的生成模型:无需标记器、预训练或额外损失函数。我们的方法在概念上仅是"纯图像Transformer"(简称JiT)。通过在ImageNet数据集256×256和512×512分辨率上使用16和32的大补丁尺寸,JiT取得了具有竞争力的结果——而在这些场景下预测高维含噪量可能导致灾难性失败。通过让网络回归流形的基本原理,我们的研究返璞归真,为基于Transformer的原始自然数据扩散建立了一个自包含的范式。
English
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
PDF591December 1, 2025