MERIT: 音響類似度のための分離音楽表現の学習

要旨

現在の音楽類似度モデルは、通常、単一の包括的なスコアを算出し、メロディ、リズム、音色といった異なる音楽的次元を混在させている。これにより、ユーザーの制御性や解釈可能性が制限され、微妙な違いを考慮したクエリの実行が不可能となる。本稿では、これら三つの主要な次元に特化した、分離された因子固有の音楽表現を学習するフレームワークであるMERITを提案する。実世界の音響データには、個別の音楽的変化が欠如しているという課題に対処するため、条件付き音声生成と音源分離されたステムを活用した新規な訓練戦略を導入し、訓練データにおいて単一因子の変動を強く促進する。評価の結果、因子ごとの分離が強力に達成されていることが示された。各ヘッドは、意図された知覚次元に対して強い応答を示す一方、他の次元についてはほぼ偶然レベルの応答に留まり、この表現特性は合成訓練領域と独立した実世界音響の両方で一貫して確認された。

English

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.