De Diffusie Dualiteit, Hoofdstuk II: Ψ-Samplers en Efficiënte Curriculumstructuur

Samenvatting

Uniform-state discrete diffusiemodellen blinken uit in generatie met weinig stappen en begeleiding dankzij hun vermogen tot zelfcorrectie, waardoor ze de voorkeur genieten boven autoregressieve of gemaskeerde diffusiemodellen in deze settings. Hun samplekwaliteit bereikt echter een plateau bij ancestrale samplers naarmate het aantal stappen toeneemt. Wij introduceren een familie van Predictor-Corrector (PC) samplers voor discrete diffusie die eerdere methoden generaliseren en toepasbaar zijn op willekeurige ruisprocessen. Wanneer gekoppeld aan uniform-state diffusie, overtreffen onze samplers ancestrale sampling bij zowel taal- als beeldmodellering, met een lagere generatieve perplexiteit bij gelijke unigram-entropie op OpenWebText en betere FID/IS-scores op CIFAR10. Cruciaal is dat onze PC-methoden, in tegenstelling tot conventionele samplers, blijven verbeteren met meer samplingstappen. Deze bevindingen zetten de aanname dat gemaskeerde diffusie de onvermijdelijke toekomst is van diffusiegebaseerde taalmodellering op losse schroeven. Naast sampling ontwikkelen we een geheugenefficiënt curriculum voor de Gaussische relaxatietrainingsfase, waardoor de traintijd met 25% en het geheugengebruik met 33% wordt verminderd ten opzichte van Duo, terwijl vergelijkbare perplexiteit op OpenWebText en LM1B en sterke downstream-prestaties behouden blijven. We publiceren code, checkpoints en een videotutorial op: https://s-sahoo.com/duo-ch2

English

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2

De Diffusie Dualiteit, Hoofdstuk II: Ψ-Samplers en Efficiënte Curriculumstructuur

The Diffusion Duality, Chapter II: Ψ-Samplers and Efficient Curriculum

Samenvatting

Support