ChatPaper.aiChatPaper

UniMuMo: 统一文本、音乐和动作生成

UniMuMo: Unified Text, Music and Motion Generation

October 6, 2024
作者: Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu, Chuang Gan
cs.AI

摘要

我们介绍UniMuMo,这是一个统一的多模态模型,能够接受任意文本、音乐和动作数据作为输入条件,以生成跨越所有三种模态的输出。为了解决缺乏时间同步数据的问题,我们根据节奏模式对不配对的音乐和动作数据进行对齐,以利用现有的大规模仅音乐和仅动作数据集。通过将音乐、动作和文本转换为基于标记的表示,我们的模型通过统一的编码器-解码器变压器架构连接这些模态。为了支持单个框架内的多个生成任务,我们引入了几项架构改进。我们建议使用音乐码书对动作进行编码,将动作映射到与音乐相同的特征空间。我们提出了一种音乐-动作并行生成方案,将所有音乐和动作生成任务统一到单个变压器解码器架构中,通过单个训练任务实现音乐-动作联合生成。此外,该模型经过微调现有的预训练单模态模型而设计,显著降低了计算需求。大量实验证明UniMuMo在音乐、动作和文本模态的所有单向生成基准测试中取得了竞争性结果。定量结果可在{项目页面}上找到。
English
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the https://hanyangclarence.github.io/unimumo_demo/{project page}.

Summary

AI-Generated Summary

PDF192November 16, 2024