终端速度匹配

摘要

我们提出终端速度匹配（TVM），作为流匹配的泛化形式，能够实现高保真度的单步与少步生成建模。TVM模拟任意两个扩散时间步之间的转移过程，并在其终端时刻而非初始时刻对其行为进行正则化。我们证明当模型满足Lipschitz连续性时，TVM为数据分布与模型分布之间的2-Wasserstein距离提供了上界。然而由于扩散变换器不具备该性质，我们引入了最小限度的架构调整以实现稳定的单阶段训练。为使TVM在实践中高效运行，我们开发了支持雅可比-向量积反向传播的融合注意力核，该设计能随变换器架构良好扩展。在ImageNet-256×256数据集上，TVM以单次函数评估（NFE）取得3.29 FID，4次NFE取得1.99 FID；在ImageNet-512×512数据集上同样实现单次NFE 4.32 FID和4次NFE 2.94 FID的性能，代表了从零开始训练的单步/少步模型的最高水平。

English

We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the 2-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.