Video-Infinity: 분산형 장영상 생성

초록

디퓨전 모델은 최근 비디오 생성 분야에서 주목할 만한 성과를 거두었습니다. 이러한 고무적인 성능에도 불구하고, 생성된 비디오는 일반적으로 적은 수의 프레임으로 제한되어 단 몇 초 길이의 클립에 그치는 경우가 많습니다. 더 긴 비디오를 생성하는 데 있어 주요한 과제는 단일 GPU에서 요구되는 상당한 메모리 요구량과 긴 처리 시간입니다. 간단한 해결책은 작업 부하를 여러 GPU에 분산시키는 것이지만, 이는 두 가지 문제를 야기합니다: (1) 모든 GPU가 타이밍과 컨텍스트 정보를 효과적으로 공유하도록 통신을 보장하는 것, (2) 짧은 시퀀스로 학습된 기존 비디오 디퓨전 모델을 추가 학습 없이 더 긴 비디오를 생성하도록 수정하는 것. 이를 해결하기 위해, 본 논문에서는 장편 비디오 생성을 위해 여러 GPU 간 병렬 처리를 가능하게 하는 분산 추론 파이프라인인 Video-Infinity를 소개합니다. 구체적으로, 우리는 두 가지 일관된 메커니즘을 제안합니다: 클립 병렬 처리(Clip parallelism)와 이중 범위 주의(Dual-scope attention). 클립 병렬 처리는 GPU 간 컨텍스트 정보의 수집과 공유를 최적화하여 통신 오버헤드를 최소화하고, 이중 범위 주의는 시간적 자기 주의를 조절하여 장치 간 로컬 및 글로벌 컨텍스트를 효율적으로 균형 있게 조정합니다. 이 두 메커니즘이 함께 작동하여 작업 부하를 분산시키고 빠른 장편 비디오 생성을 가능하게 합니다. 8개의 Nvidia 6000 Ada GPU(48G) 설정에서, 우리의 방법은 약 5분 만에 최대 2,300 프레임의 비디오를 생성하며, 이는 기존 방법보다 100배 빠른 속도로 장편 비디오 생성을 가능하게 합니다.

English

Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.

Video-Infinity: 분산형 장영상 생성

Video-Infinity: Distributed Long Video Generation

초록

Support