M^{2}UGen: 대규모 언어 모델의 힘을 활용한 다중 모달 음악 이해 및 생성

초록

현재 대규모 언어 모델(LLM)을 활용한 연구 동향은 급증하고 있습니다. 많은 연구들이 이러한 모델의 강력한 추론 능력을 활용하여 텍스트, 음성, 이미지, 비디오 등 다양한 양식을 이해하고 있습니다. 또한 LLM을 사용하여 인간의 의도를 이해하고 이미지, 비디오, 음악과 같은 원하는 출력물을 생성하기도 합니다. 그러나 LLM을 사용하여 이해와 생성을 모두 결합한 연구는 여전히 제한적이며 초기 단계에 있습니다. 이러한 격차를 해결하기 위해, 우리는 다양한 양식의 음악을 이해하고 생성할 수 있는 LLM의 능력을 통합한 다중 양식 음악 이해 및 생성(M^{2}UGen) 프레임워크를 소개합니다. M^{2}UGen 프레임워크는 사전 학습된 MERT, ViT, ViViT 모델을 각각 사용하여 음악, 이미지, 비디오를 포함한 다양한 영감의 원천에서 창의적 잠재력을 발휘하도록 특별히 설계되었습니다. 음악 생성을 위해 AudioLDM 2와 MusicGen의 사용을 탐구합니다. 다중 양식 이해와 음악 생성을 연결하는 것은 LLaMA 2 모델의 통합을 통해 이루어집니다. 또한, MU-LLaMA 모델을 사용하여 텍스트/이미지/비디오에서 음악으로의 생성을 지원하는 광범위한 데이터셋을 생성하여 M^{2}UGen 프레임워크의 학습을 용이하게 합니다. 우리는 제안된 프레임워크에 대한 철저한 평가를 수행합니다. 실험 결과는 우리 모델이 현재 최첨단 모델의 성능을 달성하거나 능가함을 보여줍니다.

English

The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M^{2}UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M^{2}UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively. To enable music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging multi-modal understanding and music generation is accomplished through the integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA model to generate extensive datasets that support text/image/video-to-music generation, facilitating the training of our M^{2}UGen framework. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models.

M^{2}UGen: 대규모 언어 모델의 힘을 활용한 다중 모달 음악 이해 및 생성

M^{2}UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

초록

Support