MusiConGen: トランスフォーマーベースのテキスト音楽生成におけるリズムとコードの制御

要旨

既存のテキストから音楽を生成するモデルは、高品質なオーディオを多様に生成することが可能です。しかし、テキストプロンプトだけでは、生成される音楽のコードやリズムといった時間的な音楽的特徴を精密に制御することはできません。この課題に対処するため、我々はMusiConGenを紹介します。これは、事前学習済みのMusicGenフレームワークを基盤とした、時間的制約を付加したTransformerベースのテキストから音楽を生成するモデルです。我々の革新は、コンシューマーグレードのGPU向けに最適化された効率的なファインチューニングメカニズムにあり、自動抽出されたリズムとコードを条件信号として統合します。推論時には、条件として、参照オーディオ信号から抽出された音楽的特徴、またはユーザー定義のシンボリックコード進行、BPM、テキストプロンプトを使用することができます。抽出された特徴からなるデータセットとユーザー作成の入力からなるデータセットの2つを用いた性能評価により、MusiConGenが指定された条件に良く合致したリアルなバッキングトラック音楽を生成できることを示しました。我々はコードとモデルのチェックポイントをオープンソース化し、オンラインでオーディオ例を提供しています。詳細はhttps://musicongen.github.io/musicongen_demo/をご覧ください。

English

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.

MusiConGen: トランスフォーマーベースのテキスト音楽生成におけるリズムとコードの制御

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

要旨

Support