MERIT：學習解耦的音樂表示以實現音頻相似度

摘要

目前的音樂相似度模型通常計算單一且整體的分數，將旋律、節奏和音色等不同音樂維度糾纏在一起。這限制了使用者的控制力與可解釋性，使其無法執行細膩的查詢。我們提出MERIT，一個專為學習這三個核心維度而設計的解糾纏、因子特定音樂表徵框架。為了解決真實世界音訊缺乏孤立音樂變異的問題，我們採用新穎的訓練策略，利用條件式音訊生成與聲源分離音軌，強烈鼓勵訓練數據中呈現單一因子變異。我們的評估展示了良好的因子層級解糾纏效果。每個頭部對其預期的感知維度反應強烈，而對其他維度則幾乎維持隨機水準，此表徵特性在合成訓練領域與獨立的真實世界音訊中均保持一致。

English

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.