MusicMagus: 拡散モデルによるゼロショットテキスト・ツー・ミュージック編集

要旨

テキストから音楽を生成するモデルの最近の進展により、音楽創作の新たな可能性が開かれました。しかし、音楽生成は通常、反復的な改良を伴い、生成された音楽をどのように編集するかが重要な課題となっています。本論文では、このようなモデルによって生成された音楽を編集するための新しいアプローチを提案します。この方法では、ジャンル、ムード、楽器などの特定の属性を変更しながら、他の側面を維持することが可能です。私たちの手法は、テキスト編集を潜在空間の操作に変換し、一貫性を保つための追加の制約を加えます。この方法は、既存の事前学習済みテキストから音楽を生成する拡散モデルとシームレスに統合され、追加の学習を必要としません。実験結果は、スタイルや音色の転送評価において、ゼロショットおよび一部の教師ありベースラインを上回る優れた性能を示しています。さらに、実世界の音楽編集シナリオにおける本手法の実用性を実証します。

English

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and instrument, while maintaining other aspects unchanged. Our method transforms text editing to latent space manipulation while adding an extra constraint to enforce consistency. It seamlessly integrates with existing pretrained text-to-music diffusion models without requiring additional training. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations. Additionally, we showcase the practical applicability of our approach in real-world music editing scenarios.

MusicMagus: 拡散モデルによるゼロショットテキスト・ツー・ミュージック編集

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

要旨

Support