統一潛在表徵(UL):如何訓練你的潛在空間
Unified Latents (UL): How to train your latents
February 19, 2026
作者: Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans
cs.AI
摘要
我們提出統一潛在表徵框架(UL),這是一種通過擴散先驗聯合正則化並由擴散模型解碼的潛在表徵學習方法。通過將編碼器輸出噪聲與先驗的最小噪聲級別相連結,我們得到了一個能為潛在位元率提供緊緻上界的簡潔訓練目標。在ImageNet-512數據集上,我們的方法實現了1.4的競爭性FID分數,並在保持高重建質量(PSNR)的同時,所需訓練FLOPs少於基於Stable Diffusion潛在空間訓練的模型。在Kinetics-600數據集上,我們以1.3的FVD分數刷新了當前最佳性能紀錄。
English
We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.