利用上下文预测改进基于扩散的图像合成

摘要

扩散模型是一类新型生成模型，极大地推动了具有前所未有质量和多样性的图像生成。现有的扩散模型主要尝试从一个受损图像中重建输入图像，沿空间轴使用像素级或特征级约束。然而，这种基于点的重建可能无法使每个预测的像素/特征完全保留其邻域上下文，从而损害了基于扩散的图像合成。作为自动监督信号的强大来源，上下文已被广泛研究用于学习表示。受此启发，我们首次提出了ConPreDiff，以改善基于扩散的图像合成，通过上下文预测。我们在训练阶段在扩散去噪块的末端引入上下文解码器，明确加强每个点预测其邻域上下文（即多步长特征/标记/像素），并在推理时移除解码器。这样，每个点可以通过保留与邻域上下文的语义连接来更好地重建自身。这种新的ConPreDiff范式可以推广到任意离散和连续的扩散骨干，而不会在采样过程中引入额外参数。我们在无条件图像生成、文本到图像生成和图像修复任务上进行了大量实验。我们的ConPreDiff始终优于先前的方法，并在MS-COCO上实现了新的文本到图像生成结果，零样本FID分数为6.21。

English

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

利用上下文预测改进基于扩散的图像合成

Improving Diffusion-Based Image Synthesis with Context Prediction

摘要

Support