MERIT:學習解耦的音樂表示以實現音頻相似度
MERIT: Learning Disentangled Music Representations for Audio Similarity
May 26, 2026
作者: Abhinaba Roy, Junyi Liang, Dorien Herremans
cs.AI
摘要
目前的音樂相似度模型通常計算單一且整體的分數,將旋律、節奏和音色等不同音樂維度糾纏在一起。這限制了使用者的控制力與可解釋性,使其無法執行細膩的查詢。我們提出MERIT,一個專為學習這三個核心維度而設計的解糾纏、因子特定音樂表徵框架。為了解決真實世界音訊缺乏孤立音樂變異的問題,我們採用新穎的訓練策略,利用條件式音訊生成與聲源分離音軌,強烈鼓勵訓練數據中呈現單一因子變異。我們的評估展示了良好的因子層級解糾纏效果。每個頭部對其預期的感知維度反應強烈,而對其他維度則幾乎維持隨機水準,此表徵特性在合成訓練領域與獨立的真實世界音訊中均保持一致。
English
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.