UniMuMo: 통합 텍스트, 음악 및 동작 생성

초록

UniMuMo를 소개합니다. UniMuMo는 임의의 텍스트, 음악 및 동작 데이터를 입력 조건으로 사용하여 세 가지 모드 간에 출력을 생성할 수 있는 통합된 다중 모달 모델입니다. 시간 동기화된 데이터 부족 문제를 해결하기 위해, 우리는 리듬 패턴을 기반으로 비쌍의 음악 및 동작 데이터를 정렬하여 기존 대규모 음악 전용 및 동작 전용 데이터셋을 활용합니다. 음악, 동작 및 텍스트를 토큰 기반 표현으로 변환함으로써, 우리 모델은 통합된 인코더-디코더 트랜스포머 아키텍처를 통해 이러한 모드 간의 간극을 줄입니다. 단일 프레임워크 내에서 여러 생성 작업을 지원하기 위해 여러 구조적 개선을 도입합니다. 우리는 동작을 음악 코드북으로 인코딩하고, 동작을 음악과 동일한 특징 공간으로 매핑하는 것을 제안합니다. 우리는 음악-동작 병렬 생성 체계를 도입하여 모든 음악 및 동작 생성 작업을 음악-동작 합성 생성의 단일 트랜스포머 디코더 아키텍처로 통합합니다. 또한, 모델은 기존 사전 훈련된 단일 모드 모델을 세밀하게 조정하여 계산 요구를 크게 줄였습니다. 광범위한 실험 결과에서 UniMuMo가 음악, 동작 및 텍스트 모드에서 모두 경쟁력 있는 결과를 달성했음을 입증했습니다. 양적 결과는 {프로젝트 페이지}에서 확인할 수 있습니다.

English

We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the https://hanyangclarence.github.io/unimumo_demo/{project page}.

UniMuMo: 통합 텍스트, 음악 및 동작 생성

UniMuMo: Unified Text, Music and Motion Generation

초록

Support