正規化軌跡模型

摘要

基於擴散的模型將取樣分解為多個微小的高斯去噪步驟——但當生成過程被壓縮成少數粗略的過渡步驟時，此假設便不再成立。現有的少步數方法透過蒸餾、一致性訓練或對抗性目標來解決此問題，但卻在過程中犧牲了似然框架。我們提出歸一化軌跡模型（Normalizing Trajectory Models, NTM），將每個反向步驟建模為具備精確似然訓練能力的表達性條件歸一化流。在架構上，NTM 在每個步驟內結合淺層可逆區塊，並在整個軌跡上搭配深層平行預測器，形成一個可從頭訓練或從預訓練流匹配模型初始化的端到端網路。其精確的軌跡似然進一步實現了自蒸餾：一個在模型自身分數上訓練的輕量級去噪器，能在四個步驟中產生高品質樣本。在文字到圖像基準測試上，NTM 僅需四個取樣步驟即可匹配或超越強大的圖像生成基準，同時獨特地保留了生成軌跡上的精確似然。

English

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.