基于时间频率感知器的多轨音乐转录

摘要

多轨音乐转录旨在将音乐音频输入同时转录为多个乐器的乐谱。这是一项非常具有挑战性的任务，通常需要更复杂的模型才能取得令人满意的结果。此外，先前的研究大多集中在常规乐器的转录上，而忽略了通常是音乐中最重要的信号源的人声。在本文中，我们提出了一种新颖的深度神经网络架构，名为Perceiver TF，用于对音频输入的时频表示进行多轨转录建模。Perceiver TF通过引入一个具有额外Transformer层的分层扩展来增强Perceiver架构，以建模时间上的连贯性。因此，我们的模型继承了Perceiver的优势，具有更好的可扩展性，使其能够很好地处理单个模型中许多乐器的转录。在实验中，我们以多任务学习的方式训练Perceiver TF来建模12个乐器类别以及人声。我们的结果表明，所提出的系统在各种公共数据集上优于最先进的对手（例如MT3和SpecTNT）。

English

Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously. It is a very challenging task that typically requires a more complex model to achieve satisfactory result. In addition, prior works mostly focus on transcriptions of regular instruments, however, neglecting vocals, which are usually the most important signal source if present in a piece of music. In this paper, we propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription. Perceiver TF augments the Perceiver architecture by introducing a hierarchical expansion with an additional Transformer layer to model temporal coherence. Accordingly, our model inherits the benefits of Perceiver that posses better scalability, allowing it to well handle transcriptions of many instruments in a single model. In experiments, we train a Perceiver TF to model 12 instrument classes as well as vocal in a multi-task learning manner. Our result demonstrates that the proposed system outperforms the state-of-the-art counterparts (e.g., MT3 and SpecTNT) on various public datasets.

基于时间频率感知器的多轨音乐转录

Multitrack Music Transcription with a Time-Frequency Perceiver

摘要

Support