컨텍스트 예측을 통한 확산 기반 이미지 합성 성능 향상

초록

확산 모델은 새로운 종류의 생성 모델로, 전례 없는 품질과 다양성으로 이미지 생성을 크게 촉진시켰다. 기존의 확산 모델은 주로 공간 축을 따라 픽셀 단위 또는 특징 단위의 제약 조건을 통해 손상된 입력 이미지를 재구성하려고 시도한다. 그러나 이러한 점 기반 재구성은 각 예측된 픽셀/특징이 주변 맥락을 완전히 보존하지 못할 가능성이 있어, 확산 기반 이미지 합성에 악영향을 미칠 수 있다. 자동 감독 신호의 강력한 원천으로서, 맥락은 표현 학습을 위해 잘 연구되어 왔다. 이를 영감으로, 우리는 맥락 예측을 통해 확산 기반 이미지 합성을 개선하기 위해 ConPreDiff를 처음으로 제안한다. 우리는 훈련 단계에서 확산 노이즈 제거 블록의 끝에 맥락 디코더를 추가하여 각 점이 주변 맥락(즉, 다중 스트라이드 특징/토큰/픽셀)을 예측하도록 명시적으로 강화하고, 추론 단계에서는 디코더를 제거한다. 이 방식으로 각 점은 주변 맥락과의 의미적 연결을 보존함으로써 스스로를 더 잘 재구성할 수 있다. ConPreDiff의 이 새로운 패러다임은 샘플링 과정에서 추가 매개변수를 도입하지 않고도 임의의 이산 및 연속 확산 백본에 일반화될 수 있다. 무조건 이미지 생성, 텍스트-이미지 생성, 이미지 인페인팅 작업에 대한 광범위한 실험이 수행되었다. 우리의 ConPreDiff는 이전 방법들을 일관되게 능가하며, MS-COCO에서 새로운 SOTA 텍스트-이미지 생성 결과를 달성했으며, 제로샷 FID 점수는 6.21이다.

English

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

컨텍스트 예측을 통한 확산 기반 이미지 합성 성능 향상

Improving Diffusion-Based Image Synthesis with Context Prediction

초록

Support