MACE-Dance:基于运动-外观级联专家的音乐驱动舞蹈视频生成
MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation
May 7, 2026
作者: Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He
cs.AI
摘要
随着在线舞蹈视频平台的兴起及人工智能生成内容(AIGC)技术的飞速发展,音乐驱动的舞蹈生成已成为一个引人注目的研究领域。尽管在音乐驱动的三维舞蹈生成、姿态驱动的图像动画以及音频驱动的说话头合成等相关领域已取得显著进展,现有方法仍无法直接应用于此任务。此外,该领域有限的研究仍难以同时实现高质量的视觉呈现与逼真的人体动作。为此,我们提出了MACE-Dance,一个基于级联专家混合模型(MoE)的音乐驱动舞蹈视频生成框架。其中,运动专家负责音乐到三维动作的生成,确保运动学上的合理性与艺术表现力;而外观专家则执行基于动作与参考条件的视频合成,保持视觉身份的同时实现时空一致性。具体而言,运动专家采用了一种结合BiMamba-Transformer混合架构及无引导训练(GFT)策略的扩散模型,在三维舞蹈生成上达到了业界领先水平。外观专家则采用了解耦的运动美学微调策略,在姿态驱动的图像动画方面同样取得了顶尖表现。为了更好地评估这一任务,我们构建了一个大规模且多样化的数据集,并设计了一套动作与外观的评价标准。基于此标准,MACE-Dance同样展现了卓越的性能。相关代码已公开于https://github.com/AMAP-ML/MACE-Dance。
English
With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Code is available at https://github.com/AMAP-ML/MACE-Dance.