归一化轨迹模型

摘要

基于扩散的模型将采样过程分解为许多小的高斯去噪步骤——当生成过程被压缩到少量粗略过渡时，这一假设便不再成立。现有的少步方法通过蒸馏、一致性训练或对抗目标来解决这一问题，但在此过程中牺牲了似然框架。我们提出了归一化轨迹模型（Normalizing Trajectory Models, NTM），该模型将每个反向步骤建模为具有精确似然训练的表达性条件归一化流。在架构上，NTM将每个步骤内的浅层可逆模块与跨轨迹的深层并行预测器相结合，形成一个端到端的网络，可从零开始训练，也可从预训练的流匹配模型初始化。其精确的轨迹似然进一步实现了自蒸馏：在模型自身分数上训练的轻量级去噪器可在四步内生成高质量样本。在文本到图像基准测试中，NTM仅需四个采样步骤即可匹配或超越强大的图像生成基线，同时独特地保留了生成轨迹上的精确似然。

English

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.