타임스텝 임베딩이 말하는 것: 비디오 확산 모델을 위한 캐싱 시간입니다

초록

비디오 생성의 기본적인 기반이 되는 확산 모델은 노이즈 제거의 순차적 특성으로 인해 추론 속도가 낮다는 도전을 받고 있습니다. 이전 방법들은 모델 출력을 캐싱하고 재사용하여 균일하게 선택된 타임스텝에서 모델을 가속화했습니다. 그러나 이러한 전략은 모델 출력 간의 차이가 타임스텝마다 균일하지 않다는 사실을 무시하여, 올바른 모델 출력을 캐싱하는 데 어려움을 일으키며, 추론 효율과 시각적 품질 사이의 적절한 균형을 방해합니다. 본 연구에서는 Timestep Embedding Aware Cache (TeaCache)라는 훈련 없이 캐싱하는 접근 방식을 소개합니다. 이 방식은 타임스텝 간 모델 출력의 변동하는 차이를 추정하고 활용합니다. TeaCache는 시간이 많이 소요되는 모델 출력을 직접 사용하는 대신, 모델 출력과 강한 상관 관계를 가지면서 무시할 만한 계산 비용을 발생시키는 모델 입력에 초점을 맞춥니다. TeaCache는 먼저 노이즈가 있는 입력을 타임스텝 임베딩을 사용하여 조절하여 그 차이가 모델 출력의 차이를 더 잘 근사하도록 합니다. 그런 다음 TeaCache는 추정된 차이를 정제하기 위한 재조정 전략을 도입하고 이를 사용하여 출력 캐싱을 지시합니다. 실험 결과, TeaCache는 시각적 품질의 저하가 미미한 (-0.07% Vbench 점수) 상태에서 Open-Sora-Plan 대비 최대 4.41배의 가속을 달성합니다.

English

As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.

타임스텝 임베딩이 말하는 것: 비디오 확산 모델을 위한 캐싱 시간입니다

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

초록

Support