X-Dancer：從表達性音樂生成人體舞蹈影片

摘要

我們提出X-Dancer，這是一種新穎的零樣本音樂驅動圖像動畫流程，能夠從單張靜態圖像生成多樣化且長距離的逼真人類舞蹈視頻。其核心在於引入了一個統一的Transformer-擴散框架，該框架包含一個自回歸Transformer模型，用於合成與音樂同步的二維身體、頭部和手部姿勢的擴展標記序列，這些序列隨後引導擴散模型生成連貫且逼真的舞蹈視頻幀。與傳統主要生成三維人體運動的方法不同，X-Dancer通過建模廣泛的二維舞蹈動作，捕捉其與音樂節拍的細微對齊，來應對數據限制並提升可擴展性，這一切基於易獲取的單目視頻。為實現這一點，我們首先從帶有關鍵點置信度的二維人體姿勢標籤中構建空間組合的標記表示，編碼了大幅度的關節身體運動（如上身和下身）以及細微動作（如頭部和手部）。接著，我們設計了一個音樂到動作的Transformer模型，該模型自回歸地生成與音樂對齊的舞蹈姿勢標記序列，並結合了對音樂風格和先前動作上下文的全局注意力。最後，我們利用擴散骨架，通過AdaIN技術將這些合成的姿勢標記應用於參考圖像的動畫化，形成了一個完全可微分的端到端框架。實驗結果表明，X-Dancer能夠生成既多樣化又具特色的舞蹈視頻，在多樣性、表現力和真實感方面大幅超越現有最先進的方法。代碼和模型將供研究用途公開。

English

We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism. Code and model will be available for research purposes.

X-Dancer：從表達性音樂生成人體舞蹈影片

X-Dancer: Expressive Music to Human Dance Video Generation

摘要

Support