M^{2}UGen：利用大型語言模型的力量進行多模態音樂理解與生成

摘要

目前利用大型語言模型（LLMs）進行研究的領域正在迅速增長。許多作品利用這些模型強大的推理能力來理解各種形式，如文本、語音、圖像、視頻等。它們還利用LLMs來理解人類意圖並生成所需的輸出，如圖像、視頻和音樂。然而，利用LLMs結合理解和生成的研究仍然有限且處於起步階段。為了填補這一空白，我們引入了一個多模態音樂理解和生成（M^{2}UGen）框架，該框架整合了LLM的理解和生成音樂的能力，適用於不同的形式。M^{2}UGen框架旨在從多種靈感來源中釋放創造潛力，包括音樂、圖像和視頻，通過分別使用預訓練的MERT、ViT和ViViT模型。為了實現音樂生成，我們探索了AudioLDM 2 和MusicGen的應用。通過整合LLaMA 2模型，實現多模態理解和音樂生成的橋樑。此外，我們利用MU-LLaMA模型生成大量支持文本/圖像/視頻轉換為音樂生成的數據集，有助於訓練我們的M^{2}UGen框架。我們對我們提出的框架進行了全面評估。實驗結果表明，我們的模型實現或超越了當前最先進模型的性能。

English

The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M^{2}UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M^{2}UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively. To enable music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging multi-modal understanding and music generation is accomplished through the integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA model to generate extensive datasets that support text/image/video-to-music generation, facilitating the training of our M^{2}UGen framework. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models.

M^{2}UGen：利用大型語言模型的力量進行多模態音樂理解與生成

M^{2}UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

摘要

Support