通過上下文預測改進基於擴散的圖像合成

摘要

擴散模型是一種新型生成模型，顯著提升了具有前所未有品質和多樣性的圖像生成。現有的擴散模型主要嘗試從受損的圖像中重建輸入圖像，並沿著空間軸進行像素級或特徵級的約束。然而，這種基於點的重建可能無法使每個預測的像素/特徵完全保留其鄰域上下文，從而損害了基於擴散的圖像合成。作為自動監督信號的強大來源，上下文已被廣泛研究用於學習表示。受此啟發，我們首次提出ConPreDiff來改善基於擴散的圖像合成，通過上下文預測。我們在訓練階段在擴散去噪塊的末尾添加上下文解碼器，明確地強化每個點以預測其鄰域上下文（即多步長特徵/標記/像素），並在推論時刪除解碼器。通過這種方式，每個點可以更好地通過保留與鄰域上下文的語義聯繫來重建自身。ConPreDiff這種新的範式可以廣泛應用於任意離散和連續的擴散骨幹，而不需要在採樣過程中引入額外的參數。在無條件圖像生成、文本到圖像生成和圖像修補任務上進行了大量實驗。我們的ConPreDiff在MS-COCO上始終優於先前的方法，並在文本到圖像生成任務上取得了新的SOTA結果，零樣本FID分數為6.21。

English

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

通過上下文預測改進基於擴散的圖像合成

Improving Diffusion-Based Image Synthesis with Context Prediction

摘要

Support