이중 병렬성을 갖춘 1분 길이의 동영상

초록

Diffusion Transformer(DiT) 기반 비디오 확산 모델은 대규모로 고품질 비디오를 생성하지만, 긴 비디오의 경우 처리 지연 시간과 메모리 비용이 과도하게 발생합니다. 이를 해결하기 위해, 우리는 DualParal이라는 새로운 분산 추론 전략을 제안합니다. 핵심 아이디어는 단일 GPU에서 전체 비디오를 생성하는 대신, 시간적 프레임과 모델 레이어를 GPU 간에 병렬화하는 것입니다. 그러나 이러한 분할을 단순히 구현할 경우 주요 한계가 발생합니다: 확산 모델은 프레임 간에 동기화된 노이즈 레벨을 요구하기 때문에, 이 구현은 원래의 병렬성을 직렬화하게 됩니다. 우리는 이를 해결하기 위해 블록 단위 노이즈 제거 방식을 활용합니다. 즉, 점진적으로 감소하는 노이즈 레벨을 통해 프레임 블록 시퀀스를 파이프라인으로 처리합니다. 각 GPU는 특정 블록과 레이어 하위 집합을 처리하면서 이전 결과를 다음 GPU로 전달하여 비동기적 계산과 통신을 가능하게 합니다. 성능을 더욱 최적화하기 위해, 우리는 두 가지 주요 개선 사항을 도입했습니다. 첫째, 각 GPU에 피처 캐시를 구현하여 이전 블록의 피처를 컨텍스트로 저장하고 재사용함으로써 GPU 간 통신과 중복 계산을 최소화합니다. 둘째, 조정된 노이즈 초기화 전략을 사용하여 초기 노이즈 패턴을 GPU 간에 공유함으로써 전역적으로 일관된 시간적 동역학을 보장하며 추가 자원 비용 없이 이를 달성합니다. 이를 통해 빠르고 아티팩트가 없으며 무한히 긴 비디오 생성을 가능하게 합니다. 최신 확산 트랜스포머 비디오 생성기에 적용한 결과, 우리의 방법은 8개의 RTX 4090 GPU에서 1,025 프레임 비디오를 최대 6.54배 낮은 지연 시간과 1.48배 낮은 메모리 비용으로 효율적으로 생성합니다.

English

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54times lower latency and 1.48times lower memory cost on 8timesRTX 4090 GPUs.

이중 병렬성을 갖춘 1분 길이의 동영상

Minute-Long Videos with Dual Parallelisms

초록

Support