RhymeFlow: 비동기 노이즈 제거 흐름 스케줄링을 통한 비디오 생성의 학습 불필요 가속화

초록

확산 트랜스포머(Diffusion Transformers, DiTs) 기반의 비디오 생성 모델은 비디오 합성에서 뛰어난 성능을 보이지만, 3D 어텐션의 이차 복잡도로 인해 높은 추론 지연 시간과 계산 비용을 겪는다. 기존 가속 방법들은 주로 희소 어텐션 및 KV 캐싱과 같은 기술을 통해 각 개별 잡음 제거 단계 내에서 계산 복잡도를 줄이는 데 초점을 맞춘다. 그러나 이러한 방법들은 표준 확산 파이프라인의 고유한 제약, 즉 대상 비디오 시퀀스의 모든 프레임이 모든 확산 시간 단계에 걸쳐 완전하고 조밀한 잡음 제거 과정을 거쳐야 한다는 제약을 엄격히 따른다. 우리는 인접한 프레임 간의 대응하는 콘텐츠와 움직임으로 인해, 중요한 의미론적 전환을 담당하는 키프레임이 고정되면 다른 프레임들의 중간 상태가 더 예측 가능한 궤적을 따르는 경우가 많다는 것을 관찰하였으며, 이는 이러한 균일하고 조밀한 잡음 제거 과정이 자연 비디오 데이터에 대해 본질적으로 중복됨을 시사한다. 이에 우리는 RhymeFlow를 소개한다. RhymeFlow는 훈련 없는 프레임워크로, 서로 다른 프레임의 잡음 제거 궤적을 분리한다. 구체적으로, 먼저 잠재 의미론적 진화를 주도하는 희소한 핵심 키프레임 집합을 식별한다. 그런 다음, 구조적 무결성을 보장하기 위해 이 키프레임들만 밀집된 단계별 잡음 제거를 수행하는 반면, 비키프레임은 계산 비용을 최소화하기 위해 점진적으로 잡음 제거 단계를 건너뛴다. 비키프레임의 건너뛴 중간 상태는 키프레임 잡음 제거 단계에서 시간적 일관성을 깨뜨려 시각적 저하를 초래하므로, 우리는 추가로 잠재 궤적 투영 모듈을 도입하여 키프레임이 완전하고 시간적으로 일관된 시퀀스 표현과 상호 작용할 수 있도록 한다. 현재 DiT 기반 비디오 생성 모델에 대한 광범위한 실험을 통해, 우리의 방법이 더 높은 추론 속도와 더 나은 시각적 품질로 기존 기준선을 능가함을 입증한다.

English

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce RhymeFlow, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.