让像素起舞：高动态视频生成

摘要

在人工智能领域，创建高动态视频，如运动丰富的动作和复杂的视觉效果，面临着重大挑战。不幸的是，目前的视频生成方法，主要集中在文本到视频生成，往往会产生运动最小化但保持高保真度的视频片段。我们认为仅依赖文本指令对视频生成来说是不足够且次优的。在本文中，我们介绍了PixelDance，这是一种基于扩散模型的新方法，结合了图像指令和文本指令用于视频生成的首尾帧。全面的实验结果表明，使用公共数据训练的PixelDance在合成具有复杂场景和精细动作的视频方面表现出显著更好的能力，为视频生成设定了新的标准。

English

Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

让像素起舞：高动态视频生成

Make Pixels Dance: High-Dynamic Video Generation

摘要

Support