궤적 모델 정규화

초록

확산 기반 모델은 샘플링 과정을 여러 개의 작은 가우시안 노이즈 제거 단계로 분해하는데, 이는 생성이 소수의 거친 전이로 압축될 때 성립하지 않는 가정이다. 기존의 소수 단계 방법들은 증류, 일관성 학습, 또는 적대적 목적 함수를 통해 이를 해결하지만, 그 과정에서 우도 프레임워크를 희생한다. 우리는 정규화 궤적 모델(NTM)을 소개한다. 이는 각 역방향 단계를 정확한 우도 학습을 갖춘 표현력 있는 조건부 정규화 흐름으로 모델링한다. 구조적으로 NTM은 각 단계 내의 얕은 가역 블록과 궤적 전체에 걸친 깊은 병렬 예측기를 결합하여, 처음부터 학습 가능하거나 사전 학습된 흐름 매칭 모델로 초기화 가능한 종단간 네트워크를 형성한다. 또한 정확한 궤적 우도는 자기 증류를 가능하게 한다. 모델 자체의 스코어로 학습된 경량 노이즈 제거기는 단 네 단계로 고품질 샘플을 생성한다. 텍스트-이미지 벤치마크에서 NTM은 단 4개의 샘플링 단계만으로 강력한 이미지 생성 기준 모델과 일치하거나 능가하면서, 생성 궤적에 대한 정확한 우도를 유일하게 유지한다.

English

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.