潛在一致性模型的改進訓練技術

摘要

一致性模型是一類新的生成模型，能夠在單步或多步中生成高質量樣本。最近，一致性模型展現出令人印象深刻的性能，與像素空間中擴散模型取得了相當的成果。然而，將一致性訓練擴展至大規模數據集的成功，特別是對於文本到圖像和視頻生成任務，取決於潛在空間中的性能。在這項工作中，我們分析了像素空間和潛在空間之間的統計差異，發現潛在數據通常包含高度冲動的離群值，顯著降低了潛在空間中 iCT 的性能。為了應對這一問題，我們將偽胡伯損失替換為柯西損失，有效地減輕了離群值的影響。此外，我們在早期時間步引入擴散損失，並採用最優運輸（OT）耦合來進一步增強性能。最後，我們引入自適應縮放-c 調度器來管理強大的訓練過程，並在架構中採用非縮放層歸一化，以更好地捕捉特徵的統計信息並減少離群值的影響。通過這些策略，我們成功訓練了能夠在一步或兩步中進行高質量採樣的潛在一致性模型，顯著縮小了潛在一致性與擴散模型之間的性能差距。實現代碼在此處發布：https://github.com/quandao10/sLCT/

English

Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-c scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/

潛在一致性模型的改進訓練技術

Improved Training Technique for Latent Consistency Models

摘要

Support