MACE-Dance:基於動作-外觀串聯專家模型的音樂驅動舞蹈影片生成
MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation
May 7, 2026
作者: Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He
cs.AI
摘要
隨著線上舞蹈影片平台的興起與人工智慧生成內容(AIGC)技術的快速發展,音樂驅動的舞蹈生成已成為一個極具吸引力的研究領域。儘管在音樂驅動3D舞蹈生成、姿勢驅動圖像動畫及語音驅動頭像合成等相關領域已取得顯著進展,現有方法仍無法直接適用於此任務。此外,該領域的有限研究仍難以同時實現高品質視覺外觀與逼真人體動作。為此,我們提出MACE-Dance——一個基於級聯專家混合系統(MoE)的音樂驅動舞蹈影片生成框架。其中動作專家模組負責執行音樂至3D動作的生成,同時確保運動學合理性與藝術表現力;而外觀專家模組則進行動作與參考條件下的影片合成,保持時空連貫性的視覺特徵。具體而言,動作專家採用具備BiMamba-Transformer混合架構的擴散模型及無引導訓練(GFT)策略,在3D舞蹈生成領域達到最先進(SOTA)性能;外觀專家則運用解耦的運動學-美學微調策略,於姿勢驅動圖像動畫任務中實現最佳表現。為建立更完善的評估基準,我們構建大規模多樣化數據集並設計動作-外觀聯合評估方案。基於此方案,MACE-Dance同樣展現出最優異的綜合性能。相關程式碼已開源於:https://github.com/AMAP-ML/MACE-Dance。
English
With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Code is available at https://github.com/AMAP-ML/MACE-Dance.