正規化軌道モデル

要旨

拡散モデルはサンプリングを多数の小さなガウス雑音除去ステップに分解するが、この仮定は生成が少数の粗い遷移に圧縮されると崩れる。既存の少数ステップ手法は、蒸留、一貫性学習、または敵対的目的関数によってこの問題に対処するが、その過程で尤度フレームワークを犠牲にする。本稿では、正規化流れモデル（Normalizing Trajectory Models, NTM）を導入する。これは、各逆ステップを条件付き正規化流れとして表現し、厳密な尤度学習を可能にする。アーキテクチャ上、NTMは各ステップ内の浅い可逆ブロックと、軌跡全体にわたる深い並列予測器を組み合わせ、スクラッチから学習可能または事前学習済みフローマッチングモデルから初期化可能なエンドツーエンドネットワークを形成する。さらに、その厳密な軌跡尤度により自己蒸留が可能となり、モデル自身のスコアで訓練された軽量な雑音除去器が4ステップで高品質なサンプルを生成する。テキストから画像へのベンチマークにおいて、NTMはわずか4サンプリングステップで強力な画像生成ベースラインに匹敵または凌駕し、同時に生成軌跡全体にわたって厳密な尤度を保持するという独自の特性を持つ。

English

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.