X-Dancer: 표현력 있는 음악에서 인간의 댄스 비디오 생성

초록

우리는 단일 정적 이미지로부터 다양하고 장기간의 생생한 인간 댄스 비디오를 생성하는 새로운 제로샷 음악 기반 이미지 애니메이션 파이프라인인 X-Dancer를 소개한다. X-Dancer의 핵심은 자동회귀 트랜스포머 모델을 특징으로 하는 통합 트랜스포머-디퓨전 프레임워크로, 이 모델은 2D 신체, 머리 및 손 동작을 위한 확장된 음악 동기화 토큰 시퀀스를 합성하며, 이는 디퓨전 모델이 일관되고 현실적인 댄스 비디오 프레임을 생성하도록 안내한다. 전통적인 방법들이 주로 3D 인간 동작을 생성하는 반면, X-Dancer는 데이터 제한을 해결하고 확장성을 향상시키기 위해 다양한 2D 댄스 동작을 모델링하고, 쉽게 구할 수 있는 단안 비디오를 통해 음악 비트와의 미묘한 정렬을 포착한다. 이를 위해, 우리는 먼저 키포인트 신뢰도와 연관된 2D 인간 포즈 레이블로부터 공간적으로 구성적인 토큰 표현을 구축하여, 큰 관절 신체 움직임(예: 상체 및 하체)과 세밀한 동작(예: 머리와 손)을 모두 인코딩한다. 그런 다음, 음악 스타일과 이전 동작 컨텍스트 모두에 대한 전역적 주의를 통합하여 음악과 동기화된 댄스 포즈 토큰 시퀀스를 자동회귀적으로 생성하는 음악-동작 트랜스포머 모델을 설계한다. 마지막으로, 우리는 디퓨전 백본을 활용하여 참조 이미지를 이러한 합성된 포즈 토큰을 통해 AdaIN으로 애니메이션화하며, 완전히 미분 가능한 엔드투엔드 프레임워크를 형성한다. 실험 결과는 X-Dancer가 다양하고 특징적인 댄스 비디오를 생성할 수 있으며, 다양성, 표현력 및 현실성 측면에서 최신 방법을 크게 능가함을 보여준다. 코드와 모델은 연구 목적으로 공개될 예정이다.

English

We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism. Code and model will be available for research purposes.

X-Dancer: 표현력 있는 음악에서 인간의 댄스 비디오 생성

X-Dancer: Expressive Music to Human Dance Video Generation

초록

Support