픽셀을 춤추게 하라: 고다이내믹 비디오 생성

초록

동작이 풍부한 액션과 정교한 시각 효과와 같은 고다이내믹 비디오를 생성하는 것은 인공지능 분야에서 상당한 도전 과제로 남아 있습니다. 불행히도, 현재 최첨단 비디오 생성 방법들은 주로 텍스트-투-비디오 생성에 초점을 맞추고 있어 높은 충실도를 유지하더라도 최소한의 움직임만을 보이는 비디오 클립을 생성하는 경향이 있습니다. 우리는 비디오 생성에 있어 텍스트 지시만을 의존하는 것이 불충분하며 최적이 아니라고 주장합니다. 본 논문에서는 확산 모델을 기반으로 한 새로운 접근법인 PixelDance를 소개합니다. 이 방법은 비디오 생성을 위해 첫 번째와 마지막 프레임에 대한 이미지 지시와 텍스트 지시를 결합합니다. 공개 데이터로 학습된 PixelDance는 복잡한 장면과 정교한 움직임을 가진 비디오를 합성하는 데 있어 훨씬 더 뛰어난 능력을 보여주며, 비디오 생성의 새로운 기준을 제시합니다.

English

Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

픽셀을 춤추게 하라: 고다이내믹 비디오 생성

Make Pixels Dance: High-Dynamic Video Generation

초록

Support