MACE-Dance: 음악 기반 댄스 비디오 생성을 위한 모션-외관 연계 전문가 시스템

초록

온라인 댄스 비디오 플랫폼의 부상과 AI 생성 콘텐츠(AIGC)의 급속한 발전에 따라, 음악 기반 댄스 생성은 주목할 만한 연구 방향으로 부각되고 있다. 음악 기반 3D 댄스 생성, 포즈 기반 이미지 애니메이션, 오디오 기반 토킹헤드 합성과 같은 관련 분야에서 상당한 진전이 있었음에도 불구하고, 기존 방법들은 이 작업에 직접 적용할 수 없다. 더욱이, 이 분야의 제한된 연구들은 여전히 고품질의 시각적 외관과 현실적인 인간 동작을 동시에 달성하는 데 어려움을 겪고 있다. 이에 따라, 우리는 계단식 전문가 혼합(MoE) 방식을 적용한 음악 기반 댄스 비디오 생성 프레임워크인 MACE-Dance를 제안한다. 동작 전문가(Motion Expert)는 음악을 3D 동작으로 변환하면서 운동학적 타당성과 예술적 표현력을 강화하고, 외관 전문가(Appearance Expert)는 동작과 참조 조건에 기반한 비디오 합성을 수행하여 시각적 정체성을 공간적·시간적 일관성과 함께 보존한다. 구체적으로, 동작 전문가는 BiMamba-Transformer 하이브리드 아키텍처와 지도 없는 학습(GFT) 전략을 적용한 확산 모델을 사용하여 3D 댄스 생성 분야에서 최신 기술(SOTA) 성능을 달성한다. 외관 전문가는 분리된 운동학적·미학적 미세 조정 전략을 사용하여 포즈 기반 이미지 애니메이션 분야에서 최신 기술(SOTA) 성능을 달성한다. 이 작업을 더 잘 벤치마킹하기 위해, 우리는 대규모 및 다양한 데이터셋을 구축하고 동작-외관 평가 프로토콜을 설계하였다. 이 프로토콜을 기반으로, MACE-Dance 또한 최신 기술 성능을 달성한다. 코드는 https://github.com/AMAP-ML/MACE-Dance에서 확인할 수 있다.

English

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Code is available at https://github.com/AMAP-ML/MACE-Dance.

MACE-Dance: 음악 기반 댄스 비디오 생성을 위한 모션-외관 연계 전문가 시스템

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

초록

Support