X-Dancer:從表達性音樂生成人體舞蹈影片
X-Dancer: Expressive Music to Human Dance Video Generation
February 24, 2025
作者: Zeyuan Chen, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xin Chen, Chao Wang, Di Chang, Linjie Luo
cs.AI
摘要
我們提出X-Dancer,這是一種新穎的零樣本音樂驅動圖像動畫流程,能夠從單張靜態圖像生成多樣化且長距離的逼真人類舞蹈視頻。其核心在於引入了一個統一的Transformer-擴散框架,該框架包含一個自回歸Transformer模型,用於合成與音樂同步的二維身體、頭部和手部姿勢的擴展標記序列,這些序列隨後引導擴散模型生成連貫且逼真的舞蹈視頻幀。與傳統主要生成三維人體運動的方法不同,X-Dancer通過建模廣泛的二維舞蹈動作,捕捉其與音樂節拍的細微對齊,來應對數據限制並提升可擴展性,這一切基於易獲取的單目視頻。為實現這一點,我們首先從帶有關鍵點置信度的二維人體姿勢標籤中構建空間組合的標記表示,編碼了大幅度的關節身體運動(如上身和下身)以及細微動作(如頭部和手部)。接著,我們設計了一個音樂到動作的Transformer模型,該模型自回歸地生成與音樂對齊的舞蹈姿勢標記序列,並結合了對音樂風格和先前動作上下文的全局注意力。最後,我們利用擴散骨架,通過AdaIN技術將這些合成的姿勢標記應用於參考圖像的動畫化,形成了一個完全可微分的端到端框架。實驗結果表明,X-Dancer能夠生成既多樣化又具特色的舞蹈視頻,在多樣性、表現力和真實感方面大幅超越現有最先進的方法。代碼和模型將供研究用途公開。
English
We present X-Dancer, a novel zero-shot music-driven image animation pipeline
that creates diverse and long-range lifelike human dance videos from a single
static image. As its core, we introduce a unified transformer-diffusion
framework, featuring an autoregressive transformer model that synthesize
extended and music-synchronized token sequences for 2D body, head and hands
poses, which then guide a diffusion model to produce coherent and realistic
dance video frames. Unlike traditional methods that primarily generate human
motion in 3D, X-Dancer addresses data limitations and enhances scalability by
modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment
with musical beats through readily available monocular videos. To achieve this,
we first build a spatially compositional token representation from 2D human
pose labels associated with keypoint confidences, encoding both large
articulated body movements (e.g., upper and lower body) and fine-grained
motions (e.g., head and hands). We then design a music-to-motion transformer
model that autoregressively generates music-aligned dance pose token sequences,
incorporating global attention to both musical style and prior motion context.
Finally we leverage a diffusion backbone to animate the reference image with
these synthesized pose tokens through AdaIN, forming a fully differentiable
end-to-end framework. Experimental results demonstrate that X-Dancer is able to
produce both diverse and characterized dance videos, substantially
outperforming state-of-the-art methods in term of diversity, expressiveness and
realism. Code and model will be available for research purposes.Summary
AI-Generated Summary