M^{2}UGen：利用大型语言模型的力量进行多模态音乐理解与生成

摘要

当前利用大型语言模型（LLMs）进行研究的领域正在迅速增长。许多作品利用这些模型强大的推理能力来理解各种形式，如文本、语音、图像、视频等。它们还利用LLMs来理解人类意图并生成期望的输出，如图像、视频和音乐。然而，利用LLMs结合理解和生成的研究仍然有限且处于起步阶段。为了填补这一空白，我们引入了一个多模态音乐理解与生成（M^{2}UGen）框架，该框架整合了LLM的能力，用于理解和生成不同形式的音乐。M^{2}UGen框架专为从多样化的灵感来源中释放创造潜力而设计，包括音乐、图像和视频，通过分别使用预训练的MERT、ViT和ViViT模型。为了实现音乐生成，我们探索了AudioLDM 2 和MusicGen的使用。通过LLaMA 2模型的整合，实现了多模态理解和音乐生成的桥梁。此外，我们利用MU-LLaMA模型生成大量数据集，支持文本/图像/视频到音乐的生成，促进了我们M^{2}UGen框架的训练。我们对我们提出的框架进行了彻底评估。实验结果表明，我们的模型达到或超过了当前最先进模型的性能。

English

The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M^{2}UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M^{2}UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively. To enable music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging multi-modal understanding and music generation is accomplished through the integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA model to generate extensive datasets that support text/image/video-to-music generation, facilitating the training of our M^{2}UGen framework. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models.

M^{2}UGen：利用大型语言模型的力量进行多模态音乐理解与生成

M^{2}UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

摘要

Support