동적 토큰 조각화를 통한 학습 없이 효율적인 비디오 생성

초록

비디오 Diffusion Transformer(DiT) 모델의 뛰어난 생성 품질에도 불구하고, 이들의 실제 배포는 방대한 계산 요구 사항으로 인해 심각하게 제한되고 있습니다. 이러한 비효율성은 두 가지 주요 문제에서 비롯됩니다: 토큰 길이에 대한 자기 주의(self-attention)의 이차 복잡성과 확산 모델의 다단계 특성입니다. 이러한 한계를 해결하기 위해, 우리는 동적 주의 조각화(dynamic attention carving)와 점진적 해상도 생성(progressive resolution generation)을 결합한 새로운 추론 파이프라인인 Jenga를 제안합니다. 우리의 접근 방식은 두 가지 핵심 통찰을 활용합니다: (1) 초기 노이즈 제거 단계에서는 고해상도 잠재 공간이 필요하지 않으며, (2) 후기 단계에서는 밀집된 주의가 필요하지 않다는 점입니다. Jenga는 3D 공간 채우기 곡선(space-filling curves)을 사용하여 관련 토큰 상호작용을 동적으로 선택하는 블록 단위 주의 메커니즘과, 생성 과정에서 잠재 해상도를 점진적으로 증가시키는 전략을 도입합니다. 실험 결과는 Jenga가 여러 최신 비디오 확산 모델에서 상당한 속도 향상을 달성하면서도 비슷한 생성 품질을 유지함을 보여줍니다(VBench에서 8.83배 속도 향상과 0.01% 성능 하락). 플러그 앤 플레이 솔루션으로서, Jenga는 모델 재훈련 없이도 추론 시간을 분 단위에서 초 단위로 줄여 현대 하드웨어에서 실용적이고 고품질의 비디오 생성을 가능하게 합니다. 코드: https://github.com/dvlab-research/Jenga

English

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83times speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

동적 토큰 조각화를 통한 학습 없이 효율적인 비디오 생성

Training-Free Efficient Video Generation via Dynamic Token Carving

초록

Support