ChatPaper.aiChatPaper

CLaMP 3:跨不對齊模態和未見語言的通用音樂信息檢索

CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

February 14, 2025
作者: Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, Seungheon Doh, Gus Xia, Juhan Nam, Xiaobing Li, Feng Yu, Maosong Sun
cs.AI

摘要

CLaMP 3 是一個統一的框架,旨在應對音樂信息檢索中跨模態和跨語言泛化的挑戰。通過對比學習,它將所有主要音樂模態(包括樂譜、演奏信號和音頻錄音)與多語言文本在共享表示空間中對齊,實現跨未對齊模態的檢索,以文本作為橋樑。它具有一個適應未見語言的多語言文本編碼器,展現出強大的跨語言泛化能力。通過檢索增強生成,我們精心編輯了M4-RAG,這是一個包含 2.31 百萬音樂-文本對的大規模網絡數據集。該數據集豐富了詳細的元數據,代表了廣泛的全球音樂傳統。為了推動未來研究,我們發布了WikiMT-X,這是一個包含 1,000 套樂譜、音頻和豐富多樣文本描述的基準測試集。實驗表明,CLaMP 3 在多個音樂信息檢索任務上實現了最先進的性能,顯著超越了先前的強基線,並展現了在多模態和多語言音樂情境中的出色泛化能力。
English
CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities--including sheet music, performance signals, and audio recordings--with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.

Summary

AI-Generated Summary

PDF42February 17, 2025