视频无限:分布式长视频生成
Video-Infinity: Distributed Long Video Generation
June 24, 2024
作者: Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang
cs.AI
摘要
扩散模型最近在视频生成方面取得了显著的成果。尽管表现令人鼓舞,但生成的视频通常受限于少量帧数,导致片段仅持续几秒钟。在生成更长视频方面的主要挑战包括巨大的内存需求和在单个 GPU 上需要的延长处理时间。一个直接的解决方案是将工作负载分配到多个 GPU 上,然而这会引发两个问题:(1) 确保所有 GPU 有效通信以共享时间和上下文信息,以及 (2) 修改现有视频扩散模型,通常是在短序列上训练的,以在无需额外训练的情况下生成更长的视频。为了解决这些问题,在本文中我们引入了 Video-Infinity,这是一个分布式推理流水线,可以实现跨多个 GPU 的并行处理,用于生成长格式视频。具体来说,我们提出了两个一致的机制:片段并行和双范围注意力。片段并行优化了跨 GPU 收集和共享上下文信息,从而最小化通信开销,而双范围注意力调节了时间自注意力,以在设备之间有效平衡局部和全局上下文。这两个机制共同发挥作用,分担工作负载,实现快速生成长视频。在 8 x Nvidia 6000 Ada GPU (48G) 设置下,我们的方法可以在大约 5 分钟内生成长达 2,300 帧的视频,使得长视频的生成速度比先前方法快 100 倍。
English
Diffusion models have recently achieved remarkable results for video
generation. Despite the encouraging performances, the generated videos are
typically constrained to a small number of frames, resulting in clips lasting
merely a few seconds. The primary challenges in producing longer videos include
the substantial memory requirements and the extended processing time required
on a single GPU. A straightforward solution would be to split the workload
across multiple GPUs, which, however, leads to two issues: (1) ensuring all
GPUs communicate effectively to share timing and context information, and (2)
modifying existing video diffusion models, which are usually trained on short
sequences, to create longer videos without additional training. To tackle
these, in this paper we introduce Video-Infinity, a distributed inference
pipeline that enables parallel processing across multiple GPUs for long-form
video generation. Specifically, we propose two coherent mechanisms: Clip
parallelism and Dual-scope attention. Clip parallelism optimizes the gathering
and sharing of context information across GPUs which minimizes communication
overhead, while Dual-scope attention modulates the temporal self-attention to
balance local and global contexts efficiently across the devices. Together, the
two mechanisms join forces to distribute the workload and enable the fast
generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our
method generates videos up to 2,300 frames in approximately 5 minutes, enabling
long video generation at a speed 100 times faster than the prior methods.Summary
AI-Generated Summary