盐度：基于缓存感知训练的自洽分布匹配快速视频生成方法

摘要

将视频生成模型蒸馏至极低推理预算（如2-4次噪声估计）对实时部署至关重要，但仍是重大挑战。基于轨迹的一致性蒸馏方法在复杂视频动态下往往趋于保守，导致生成结果过度平滑且运动表现孱弱。分布匹配蒸馏能恢复锐化的模式捕捉样本，但其局部训练信号未显式约束去噪更新在时间步间的组合方式，使得组合推演易出现漂移。为突破此局限，我们提出自洽分布匹配蒸馏，通过显式正则化连续去噪更新的端点一致组合。针对实时自回归视频生成，我们进一步将KV缓存视为质量参数化条件，提出缓存分布感知训练。该训练方案在多步推演上应用自洽分布匹配蒸馏，并引入缓存条件特征对齐目标，引导低质量输出向高质量参考对齐。在非自回归骨干网络（如Wan~2.1）和自回归实时范式（如Self Forcing）上的大量实验表明，本方法Salt能持续提升低噪声估计次数下的视频生成质量，同时兼容多种KV缓存记忆机制。源代码将发布于https://github.com/XingtongGe/Salt。

English

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed Salt, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at https://github.com/XingtongGe/Salt{https://github.com/XingtongGe/Salt}.

盐度：基于缓存感知训练的自洽分布匹配快速视频生成方法

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

摘要

Support