扩散模型在图像分类任务上胜过生成对抗网络GAN。
Diffusion Models Beat GANs on Image Classification
July 17, 2023
作者: Soumik Mukhopadhyay, Matthew Gwilliam, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Srinidhi Hegde, Tianyi Zhou, Abhinav Shrivastava
cs.AI
摘要
虽然许多无监督学习模型专注于一类任务家族,即生成式或判别式,但我们探讨了统一表示学习器的可能性:一种模型,它利用单一的预训练阶段同时处理这两类任务家族。我们确定扩散模型是一个主要候选者。扩散模型已经成为图像生成、去噪、修补、超分辨率、操作等领域的最先进方法。这些模型涉及训练 U-Net 来迭代地预测和去除噪音,生成的模型可以合成高保真度、多样化、新颖的图像。作为基于卷积的架构,U-Net 架构生成多样化的特征表示,以中间特征图的形式呈现。我们展示了这些嵌入在噪音预测任务之外也很有用,因为它们包含判别信息,也可以用于分类。我们探索了提取和利用这些嵌入进行分类任务的最佳方法,展示了在 ImageNet 分类任务上的有希望的结果。我们发现,通过仔细的特征选择和池化,扩散模型在分类任务上胜过了类似的生成-判别方法,如 BigBiGAN。我们研究了转移学习情境中的扩散模型,检查它们在几个细粒度视觉分类数据集上的表现。我们将这些嵌入与竞争架构和预训练生成的嵌入进行了比较,用于分类任务。
English
While many unsupervised learning models focus on one family of tasks, either
generative or discriminative, we explore the possibility of a unified
representation learner: a model which uses a single pre-training stage to
address both families of tasks simultaneously. We identify diffusion models as
a prime candidate. Diffusion models have risen to prominence as a
state-of-the-art method for image generation, denoising, inpainting,
super-resolution, manipulation, etc. Such models involve training a U-Net to
iteratively predict and remove noise, and the resulting model can synthesize
high fidelity, diverse, novel images. The U-Net architecture, as a
convolution-based architecture, generates a diverse set of feature
representations in the form of intermediate feature maps. We present our
findings that these embeddings are useful beyond the noise prediction task, as
they contain discriminative information and can also be leveraged for
classification. We explore optimal methods for extracting and using these
embeddings for classification tasks, demonstrating promising results on the
ImageNet classification task. We find that with careful feature selection and
pooling, diffusion models outperform comparable generative-discriminative
methods such as BigBiGAN for classification tasks. We investigate diffusion
models in the transfer learning regime, examining their performance on several
fine-grained visual classification datasets. We compare these embeddings to
those generated by competing architectures and pre-trainings for classification
tasks.