直接判別優化：你的基於似然的視覺生成模型實為GAN判別器

摘要

雖然基於似然的生成模型，特別是擴散模型和自回歸模型，在視覺生成方面取得了顯著的逼真度，但最大似然估計（MLE）目標函數本質上存在一種模式覆蓋傾向，這在模型能力有限的情況下限制了生成質量。在本研究中，我們提出了直接判別優化（DDO）作為一個統一框架，它將基於似然的生成訓練與GAN目標相結合，從而繞過這一根本性限制。我們的核心洞見是，利用可學習目標模型與固定參考模型之間的似然比來隱式地參數化判別器，這與直接偏好優化（DPO）的理念相呼應。與GAN不同，這種參數化方法無需聯合訓練生成器和判別器網絡，從而能夠直接、高效且有效地微調已訓練好的模型，使其發揮超越MLE限制的潛力。DDO可以以自我對弈的方式迭代進行，逐步精煉模型，每一輪所需的預訓練周期不到1%。我們的實驗證明了DDO的有效性，它顯著提升了先前最先進的擴散模型EDM，在CIFAR-10/ImageNet-64數據集上將FID分數從1.79/1.58降低至新的記錄1.30/0.97，並持續改善了ImageNet 256×256上無引導和CFG增強的自回歸模型的FID分數。

English

While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that bridges likelihood-based generative training and the GAN objective to bypass this fundamental constraint. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58 to new records of 1.30/0.97 on CIFAR-10/ImageNet-64 datasets, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256times256.

直接判別優化：你的基於似然的視覺生成模型實為GAN判別器

Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

摘要

Support