JAM: 細粒度の制御性と美的整合性を備えた微小フローベース楽曲生成モデル

要旨

拡散モデルとフローマッチングモデルは、近年のテキストからオーディオへの自動生成において革命をもたらしました。これらのモデルは、音声や音響イベントを忠実に再現する高品質なオーディオ出力を生成する能力をますます高めています。しかし、主に音楽や歌を対象とした創造的なオーディオ生成においては、まだ改善の余地が多く残されています。最近のオープンな歌詞から歌への生成モデル、例えばDiffRhythm、ACE-Step、LeVoなどは、娯楽用途の自動歌生成において一定の基準を確立しました。しかし、これらのモデルは、音楽家がワークフローで求めるような細かい単語レベルの制御性を欠いています。私たちの知る限り、フローマッチングを基盤としたJAMは、歌生成において単語レベルのタイミングと持続時間の制御を可能にする初めての試みであり、細かいボーカル制御を実現しています。生成された歌の品質を向上させ、人間の好みにより適応させるために、Direct Preference Optimizationを用いた美的アラインメントを実装し、合成データセットを用いてモデルを反復的に改良することで、手動のデータ注釈を不要としています。さらに、公開評価データセットJAMEを通じて、このような歌詞から歌へのモデルの評価を標準化することを目指しています。JAMは、音楽固有の属性において既存のモデルを凌駕することを示しています。

English

Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.

JAM: 細粒度の制御性と美的整合性を備えた微小フローベース楽曲生成モデル

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

要旨

Support