대각선 증류를 통한 자동회귀 비디오 생성 스트리밍

초록

대규모 사전 학습된 디퓨전 모델은 생성된 비디오의 품질을 크게 향상시켰으나, 실시간 스트리밍에서의 활용은 여전히 제한적입니다. 자기회귀 모델은 순차적 프레임 합성을 위한 자연스러운 프레임워크를 제공하지만 높은 정확도를 달성하기 위해서는 많은 계산량이 필요합니다. 디퓨전 증류는 이러한 모델을 효율적인 소수 스텝 변형으로 압축할 수 있지만, 기존의 비디오 증류 접근법은 시간적 의존성을 간과한 이미지 특화 방법을 주로 차용하고 있습니다. 이러한 기술들은 이미지 생성에서는 뛰어난 성능을 보이지만 비디오 합성에서는 성능이 떨어져, 운동 일관성 감소, 긴 시퀀스에서의 오류 누적 및 지연시간-품질 간 트레이드오프를 나타냅니다. 우리는 이러한 한계를 초래하는 두 가지 요인을 확인했습니다: 스텝 감소 동안 시간적 컨텍스트의 불충분한 활용과 다음 청크 예측에서의 후속 노이즈 수준에 대한 암묵적 예측(즉, 노출 편향)입니다. 이러한 문제를 해결하기 위해 우리는 기존 접근법과 직교적으로 동작하며 비디오 청크와 노이즈 제거 스텝 전반에 걸친 시간 정보를 더 잘 활용하는 대각선 증류(Diagonal Distillation)를 제안합니다. 우리 접근법의 핵심은 비대칭 생성 전략입니다: 초기에는 더 많은 스텝을, 후기에는 더 적은 스텝을 사용하는 것입니다. 이 설계를 통해 후기 청크는 충분히 처리된 초기 청크로부터 풍부한 외관 정보를 상속받으면서, 부분적으로 노이즈가 제거된 청크를 후속 합성을 위한 조건부 입력으로 활용할 수 있습니다. 청크 생성 동안 후속 노이즈 수준에 대한 암묵적 예측을 실제 추론 조건과 일치시킴으로써, 우리의 접근법은 오류 전파를 완화하고 장거리 시퀀스에서의 과포화 현상을 줄입니다. 우리는 또한 암묵적 광류 모델링을 통합하여 엄격한 스텝 제약 하에서도 운동 품질을 보존합니다. 우리의 방법은 5초 길이의 비디오를 2.61초 만에(최대 31 FPS) 생성하며, 증류되지 않은 모델 대비 277.3배의 속도 향상을 달성합니다.

English

Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.

대각선 증류를 통한 자동회귀 비디오 생성 스트리밍

Streaming Autoregressive Video Generation via Diagonal Distillation

초록

Support