MERIT: 学习解耦的音乐表示用于音频相似性

摘要

当前音乐相似性模型通常计算单一的整体分数，将旋律、节奏和音色等不同音乐维度纠缠在一起。这限制了用户的控制和可解释性，使得无法执行精细化查询。我们提出了MERIT框架，用于学习针对这三个核心维度进行解耦的、因子特定的音乐表征。为解决真实音频中缺乏隔离音乐变化的问题，我们采用了一种新颖的训练策略，利用条件音频生成和源分离音轨，在训练数据中强烈鼓励单因子变化。我们的评估展示了较强的因子级解耦能力。每个头部对其目标感知维度有强烈响应，而对其他维度则接近随机水平，这一表征特性在合成训练领域和独立真实音频中均保持成立。

English

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.