Zout: Zelfconsistente Distributieafstemming met Cachebewuste Training voor Snelle Videogeneratie

Samenvatting

Het distilleren van videogeneratiemodellen naar extreem lage inferentiebudgetten (bijv. 2-4 NFEs) is cruciaal voor real-time inzet, maar blijft een uitdaging. Traject-stijl consistentiedistillatie wordt vaak conservatief onder complexe videodynamiek, wat leidt tot een over-verzacht uiterlijk en zwakke beweging. Distributie-afstemmingsdistillatie (DMD) kan scherpe, modus-zoekende samples herstellen, maar de lokale trainingssignalen reguleren niet expliciet hoe denoiseringsupdates zich over tijdstappen samenstellen, waardoor samengestelde rollouts vatbaar zijn voor drift. Om deze uitdaging te overwinnen, stellen wij Zelf-Consistente Distributie-Afstemmingsdistillatie (SC-DMD) voor, die expliciet de eindpunt-consistente compositie van opeenvolgende denoiseringsupdates regulariseert. Voor real-time autoregressieve videogeneratie behandelen wij verder de KV-cache als een gekwalificeerde geparameteriseerde conditie en stellen Cache-Distributie-Bewuste training voor. Dit trainingsschema past SC-DMD toe over multi-step rollouts en introduceert een cache-geconditioneerd kenmerk-afstemmingsdoel dat lage-kwaliteit uitvoeren naar hoog-kwaliteit referenties stuurt. In uitgebreide experimenten op zowel niet-autoregressieve backbones (bijv. Wan~2.1) als autoregressieve real-time paradigma's (bijv. Self Forcing), verbetert onze methode, genaamd Salt, consistent de kwaliteit van lage-NFE videogeneratie, terwijl het compatibel blijft met diverse KV-cache geheugenmechanismen. Broncode zal worden vrijgegeven op https://github.com/XingtongGe/Salt.

English

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed Salt, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at https://github.com/XingtongGe/Salt{https://github.com/XingtongGe/Salt}.

Zout: Zelfconsistente Distributieafstemming met Cachebewuste Training voor Snelle Videogeneratie

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Samenvatting

Support