UniMuMo: 统一文本、音乐和动作生成
UniMuMo: Unified Text, Music and Motion Generation
October 6, 2024
作者: Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu, Chuang Gan
cs.AI
摘要
我们介绍UniMuMo,这是一个统一的多模态模型,能够接受任意文本、音乐和动作数据作为输入条件,以生成跨越所有三种模态的输出。为了解决缺乏时间同步数据的问题,我们根据节奏模式对不配对的音乐和动作数据进行对齐,以利用现有的大规模仅音乐和仅动作数据集。通过将音乐、动作和文本转换为基于标记的表示,我们的模型通过统一的编码器-解码器变压器架构连接这些模态。为了支持单个框架内的多个生成任务,我们引入了几项架构改进。我们建议使用音乐码书对动作进行编码,将动作映射到与音乐相同的特征空间。我们提出了一种音乐-动作并行生成方案,将所有音乐和动作生成任务统一到单个变压器解码器架构中,通过单个训练任务实现音乐-动作联合生成。此外,该模型经过微调现有的预训练单模态模型而设计,显著降低了计算需求。大量实验证明UniMuMo在音乐、动作和文本模态的所有单向生成基准测试中取得了竞争性结果。定量结果可在{项目页面}上找到。
English
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary
text, music, and motion data as input conditions to generate outputs across all
three modalities. To address the lack of time-synchronized data, we align
unpaired music and motion data based on rhythmic patterns to leverage existing
large-scale music-only and motion-only datasets. By converting music, motion,
and text into token-based representation, our model bridges these modalities
through a unified encoder-decoder transformer architecture. To support multiple
generation tasks within a single framework, we introduce several architectural
improvements. We propose encoding motion with a music codebook, mapping motion
into the same feature space as music. We introduce a music-motion parallel
generation scheme that unifies all music and motion generation tasks into a
single transformer decoder architecture with a single training task of
music-motion joint generation. Moreover, the model is designed by fine-tuning
existing pre-trained single-modality models, significantly reducing
computational demands. Extensive experiments demonstrate that UniMuMo achieves
competitive results on all unidirectional generation benchmarks across music,
motion, and text modalities. Quantitative results are available in the
https://hanyangclarence.github.io/unimumo_demo/{project page}.Summary
AI-Generated Summary