MERT:具有大規模自監督訓練的聲學音樂理解模型

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

May 31, 2023
作者: Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Yike Guo, Jie Fu
cs.AI

摘要

最近,自監督學習(SSL)作為一種有前景的範式出現,用於在視覺、文本和語音領域的大規模數據上訓練通用模型。儘管SSL在語音和音頻方面已被證明有效,但其在音樂音頻方面的應用尚未得到充分探索。這主要是由於與建模音樂知識相關的獨特挑戰,特別是音樂的音調和音高特徵。為了填補這一研究空白,我們提出了一種具有大規模自監督訓練的聲學音樂理解模型(MERT),該模型融入了教師模型,在掩碼語言建模(MLM)風格的聲學預訓練中提供虛擬標籤。在我們的探索中,我們確定了一種優越的教師模型組合,該組合在性能方面優於傳統的語音和音頻方法。這種組合包括基於剩餘向量量化 - 變分自編碼器(RVQ-VAE)的聲學教師和基於常量Q變換(CQT)的音樂教師。這些教師有效地引導我們的學生模型,一個類似BERT風格的變壓器編碼器,以更好地建模音樂音頻。此外,我們引入了一種批內噪聲混合增強以增強表示的魯棒性。此外,我們探索了各種設置以克服聲學語言模型預訓練中的不穩定性,這使我們設計的範式能夠從95M擴展到330M參數。實驗結果表明,我們的模型能夠在14個音樂理解任務上進行泛化並表現良好,並取得了最先進的整體分數。代碼和模型在線上:https://github.com/yizhilll/MERT。
English
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges associated with modelling musical knowledge, particularly its tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified a superior combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). These teachers effectively guide our student model, a BERT-style transformer encoder, to better model music audio. In addition, we introduce an in-batch noise mixture augmentation to enhance the representation robustness. Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attains state-of-the-art (SOTA) overall scores. The code and models are online: https://github.com/yizhilll/MERT.
PDF40December 15, 2024