ChatPaper.aiChatPaper

回歸基礎:讓去噪生成模型專注於去噪

Back to Basics: Let Denoising Generative Models Denoise

November 17, 2025
作者: Tianhong Li, Kaiming He
cs.AI

摘要

當今的去噪擴散模型並非傳統意義上的「去噪」,即它們並不直接預測乾淨影像。相反,神經網絡預測的是噪聲或含噪量。本文提出,預測乾淨數據與預測含噪量存在根本性差異。根據流形假設,自然數據應存在於低維流形上,而含噪量則不然。基於此假設,我們主張採用直接預測乾淨數據的模型,這使得看似容量不足的網絡能在超高維空間中有效運作。我們證明,基於像素的簡單大塊補丁Transformer可成為強大的生成模型:無需標記器、預訓練或額外損失函數。我們的方法概念上僅是「純影像Transformer」(簡稱JiT)。我們在ImageNet數據集上以256和512分辨率,使用16和32的大塊補丁尺寸的JiT模型取得了競爭性結果,而在這些場景下預測高維含噪量可能出現災難性失敗。通過讓網絡回歸流形的基本原理,我們的研究回歸本源,追求一種基於原始自然數據的Transformer擴散自洽範式。
English
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
PDF591December 1, 2025