擴散模型在圖像分類上勝過生成對抗網絡GAN。
Diffusion Models Beat GANs on Image Classification
July 17, 2023
作者: Soumik Mukhopadhyay, Matthew Gwilliam, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Srinidhi Hegde, Tianyi Zhou, Abhinav Shrivastava
cs.AI
摘要
雖然許多非監督式學習模型專注於一類任務家族,無論是生成式或是判別式,我們探索了統一表示學習器的可能性:一個模型可利用單一預訓練階段同時應對這兩類任務家族。我們確認擴散模型是一個主要候選者。擴散模型已嶄露頭角,成為圖像生成、去噪、修補、超解析、操作等的最先進方法。這類模型包括訓練 U-Net 來迭代預測並去除噪聲,結果模型能夠合成高保真度、多樣性、新穎的圖像。作為基於卷積的結構,U-Net 架構以中間特徵圖的形式生成多樣的特徵表示。我們呈現了我們的發現,這些嵌入不僅在去噪任務中有用,因為它們包含判別信息,也可用於分類。我們探索了提取和使用這些嵌入進行分類任務的最佳方法,展示了在 ImageNet 分類任務上的有希望結果。我們發現,通過仔細的特徵選擇和池化,擴散模型在分類任務上勝過了類似的生成-判別方法,如 BigBiGAN。我們在轉移學習範疇中研究了擴散模型,檢驗了它們在幾個細粒度視覺分類數據集上的表現。我們將這些嵌入與競爭架構和預訓練生成的嵌入進行比較,用於分類任務。
English
While many unsupervised learning models focus on one family of tasks, either
generative or discriminative, we explore the possibility of a unified
representation learner: a model which uses a single pre-training stage to
address both families of tasks simultaneously. We identify diffusion models as
a prime candidate. Diffusion models have risen to prominence as a
state-of-the-art method for image generation, denoising, inpainting,
super-resolution, manipulation, etc. Such models involve training a U-Net to
iteratively predict and remove noise, and the resulting model can synthesize
high fidelity, diverse, novel images. The U-Net architecture, as a
convolution-based architecture, generates a diverse set of feature
representations in the form of intermediate feature maps. We present our
findings that these embeddings are useful beyond the noise prediction task, as
they contain discriminative information and can also be leveraged for
classification. We explore optimal methods for extracting and using these
embeddings for classification tasks, demonstrating promising results on the
ImageNet classification task. We find that with careful feature selection and
pooling, diffusion models outperform comparable generative-discriminative
methods such as BigBiGAN for classification tasks. We investigate diffusion
models in the transfer learning regime, examining their performance on several
fine-grained visual classification datasets. We compare these embeddings to
those generated by competing architectures and pre-trainings for classification
tasks.