具有時間頻率感知器的多軌音樂轉錄

摘要

多軌音樂轉譜的目標是將音樂音頻輸入轉譜為多個樂器的音符。這是一項非常具挑戰性的任務，通常需要更複雜的模型才能達到滿意的結果。此外，先前的研究大多集中在常規樂器的轉譜上，卻忽略了通常是音樂中最重要的信號來源之一的人聲。本文提出了一種新穎的深度神經網絡架構，名為 Perceiver TF，用於對音頻輸入進行多軌轉譜的時間-頻率表示建模。Perceiver TF 通過引入具有額外 Transformer 層的分層擴展來擴展 Perceiver 架構，以建模時間一致性。因此，我們的模型繼承了 Perceiver 的優勢，具有更好的可擴展性，使其能夠很好地處理單個模型中的多個樂器的轉譜。在實驗中，我們以多任務學習的方式訓練 Perceiver TF 來模擬 12 種樂器類別以及人聲。我們的結果表明，所提出的系統在各種公共數據集上優於最先進的對手（例如 MT3 和 SpecTNT）。

English

Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously. It is a very challenging task that typically requires a more complex model to achieve satisfactory result. In addition, prior works mostly focus on transcriptions of regular instruments, however, neglecting vocals, which are usually the most important signal source if present in a piece of music. In this paper, we propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription. Perceiver TF augments the Perceiver architecture by introducing a hierarchical expansion with an additional Transformer layer to model temporal coherence. Accordingly, our model inherits the benefits of Perceiver that posses better scalability, allowing it to well handle transcriptions of many instruments in a single model. In experiments, we train a Perceiver TF to model 12 instrument classes as well as vocal in a multi-task learning manner. Our result demonstrates that the proposed system outperforms the state-of-the-art counterparts (e.g., MT3 and SpecTNT) on various public datasets.

具有時間頻率感知器的多軌音樂轉錄

Multitrack Music Transcription with a Time-Frequency Perceiver

摘要

Support