ChatPaper.aiChatPaper

快速视频生成的过渡匹配蒸馏法

Transition Matching Distillation for Fast Video Generation

January 14, 2026
作者: Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, Arash Vahdat
cs.AI

摘要

大规模视频扩散与流模型已在高质量视频生成领域取得显著成功,但由于其低效的多步采样过程,在实时交互应用中的使用仍受限。本研究提出过渡匹配蒸馏(TMD)框架,通过将视频扩散模型蒸馏为高效少步生成器来解决此问题。TMD的核心思想是将扩散模型的多步去噪轨迹与少步概率转移过程相匹配,其中每个转移步骤通过轻量级条件流模型实现。为实现高效蒸馏,我们将原始扩散主干网络分解为两个组件:(1)主主干网络(包含多数早期层),用于在外部转移步骤中提取语义表征;(2)流头部(由最后几层构成),利用这些表征执行多重内部流更新。给定预训练视频扩散模型,我们首先为其引入流头部,将其适配为条件流映射。随后在每步转移中结合流头部展开策略,对学生模型实施分布匹配蒸馏。基于Wan2.1 1.3B和140亿参数文生视频模型的广泛实验表明,TMD在生成速度与视觉质量间实现了灵活且优越的权衡。特别值得注意的是,在可比推理成本下,TMD在视觉保真度与提示词遵循度方面均优于现有蒸馏模型。项目页面:https://research.nvidia.com/labs/genair/tmd
English
Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd
PDF140January 17, 2026