변이 잡음 결합을 사용한 일관성 모델 훈련

초록

일관성 훈련(Consistency Training, CT)은 최근 확산 모델에 대안으로 부상하여 이미지 생성 작업에서 경쟁력 있는 성능을 달성하는 유망한 방법으로 등장했습니다. 그러나 비증류 일관성 훈련은 종종 높은 분산과 불안정성에 시달리며, 이를 분석하고 개선하는 것은 활발히 연구되고 있는 분야입니다. 본 연구에서는 Flow Matching 프레임워크를 기반으로 한 새로운 CT 훈련 접근 방식을 제안합니다. 우리의 주요 기여는 변이 오토인코더(Variational Autoencoder, VAE) 아키텍처에서 영감을 받은 훈련된 잡음 결합 방식입니다. 데이터 종속적인 잡음 방출 모델을 훈련함으로써, 우리의 방법은 간접적으로 잡음과 데이터 매핑의 기하학을 학습할 수 있습니다. 이는 고전적인 CT에서 전방 과정의 선택에 의해 고정되는 것과 대조됩니다. 다양한 이미지 데이터셋을 통한 실험 결과는 상당한 생성적 개선을 보여주며, 우리의 모델은 기준선을 능가하고 CIFAR-10에서 최신 비증류 CT FID를 달성하며, 64x64 해상도에서 ImageNet에서 최신 기술에 준하는 FID를 2단계 생성에서 달성합니다. 우리의 코드는 https://github.com/sony/vct 에서 확인할 수 있습니다.

English

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at 64 times 64 resolution in 2-step generation. Our code is available at https://github.com/sony/vct .

변이 잡음 결합을 사용한 일관성 모델 훈련

Training Consistency Models with Variational Noise Coupling

초록

Support