대규모 확산 모델 증류를 위한 점수 정규화 연속 시간 일관성 기법

초록

본 연구는 연속 시간 일관성 증류(continuous-time consistency distillation)를 일반적인 애플리케이션 수준의 이미지 및 비디오 확산 모델로 확장하는 첫 번째 시도를 나타냅니다. 연속 시간 일관성 모델(sCM)은 학문적 규모의 확산 모델 가속화에 있어 이론적으로 타당하고 실험적으로 강력하지만, 야코비안-벡터 곱(JVP) 계산의 인프라적 어려움과 표준 평가 벤치마크의 한계로 인해 대규모 텍스트-이미지 및 비디오 작업에의 적용 가능성은 여전히 불분명합니다. 우리는 먼저 병렬 처리와 호환되는 FlashAttention-2 JVP 커널을 개발하여 100억 개 이상의 파라미터를 가진 모델과 고차원 비디오 작업에서 sCM 학습을 가능하게 했습니다. 우리의 연구는 sCM이 세부적인 디테일 생성에서 근본적인 품질 한계를 보인다는 것을 밝혔으며, 이는 오류 누적과 전방 발산 목표의 "모드 커버링" 특성에 기인한다고 판단했습니다. 이를 해결하기 위해, 우리는 점수 정규화 연속 시간 일관성 모델(rCM)을 제안합니다. 이 모델은 점수 증류를 장거리 정규화 도구로 통합하여, sCM에 "모드 탐색" 역 발산을 보완함으로써 시각적 품질을 효과적으로 개선하면서도 높은 생성 다양성을 유지합니다. 140억 개의 파라미터와 5초 길이의 비디오를 포함한 대규모 모델(Cosmos-Predict2, Wan2.1)에서 검증된 rCM은 품질 지표에서 최신 증류 방법인 DMD2를 능가하거나 동등한 성능을 보이며, 다양성 측면에서도 뛰어난 장점을 제공합니다. 이 모든 것이 GAN 튜닝이나 광범위한 하이퍼파라미터 탐색 없이 이루어졌습니다. 증류된 모델은 단 1~4 단계만으로 고품질 샘플을 생성하며, 확산 샘플링 속도를 15~50배 가속화합니다. 이러한 결과는 rCM을 대규모 확산 증류를 발전시키기 위한 실용적이고 이론적으로 근거 있는 프레임워크로 자리매김합니다.

English

This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only 1sim4 steps, accelerating diffusion sampling by 15timessim50times. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

대규모 확산 모델 증류를 위한 점수 정규화 연속 시간 일관성 기법

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

초록

Support