MERIT: 오디오 유사도를 위한 분리된 음악 표현 학습

초록

기존의 음악 유사도 모델은 일반적으로 단일의 통합 점수를 계산하여 멜로디, 리듬, 음색과 같은 서로 다른 음악적 차원을 얽히게 한다. 이는 사용자 제어와 해석 가능성을 제한하여 세밀한 질의를 수행할 수 없게 만든다. 본 논문에서는 이러한 세 가지 핵심 차원에 특화된 분리된 요인별 음악 표현을 학습하기 위한 프레임워크인 MERIT을 소개한다. 실제 오디오에서 고립된 음악적 변형이 부족하다는 문제를 극복하기 위해, 조건부 오디오 생성과 소스 분리 스템을 활용하여 훈련 데이터에서 단일 요인 변형을 강력히 유도하는 새로운 훈련 전략을 사용한다. 평가 결과 강력한 요인별 분리 성능을 확인했다. 각 헤드는 의도된 지각 차원에 강하게 반응하는 반면, 다른 차원에 대해서는 거의 우연 수준에 머물렀으며, 이러한 표현 속성은 합성 훈련 도메인과 독립적인 실제 오디오 모두에서 일관되게 나타났다.

English

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.