ChatPaper.aiChatPaper

使用變分噪聲耦合訓練一致性模型

Training Consistency Models with Variational Noise Coupling

February 25, 2025
作者: Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji
cs.AI

摘要

一致性訓練(CT)最近已成為擴散模型的一個有前途的替代方案,在圖像生成任務中取得了競爭性表現。然而,非蒸餾一致性訓練通常面臨高變異性和不穩定性問題,分析和改進其訓練動態是一個活躍的研究領域。在這項工作中,我們提出了一種基於流匹配框架的新型CT訓練方法。我們的主要貢獻是一種受變分自編碼器(VAE)架構啟發的訓練噪聲耦合方案。通過訓練一個實現為編碼器架構的數據依賴性噪聲發射模型,我們的方法可以間接學習噪聲到數據映射的幾何形狀,這與傳統CT中正向過程的選擇固定不同。跨多個圖像數據集的實證結果顯示出顯著的生成改進,我們的模型優於基準並在CIFAR-10上實現了最先進的非蒸餾CT FID,並在ImageNet上以64x64分辨率的2步生成達到了與最先進技術相當的FID。我們的代碼可在 https://github.com/sony/vct 找到。
English
Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at 64 times 64 resolution in 2-step generation. Our code is available at https://github.com/sony/vct .

Summary

AI-Generated Summary

PDF72February 28, 2025