時間周波数パーシバを用いたマルチトラック音楽トランスクリプション

要旨

マルチトラック音楽トランスクリプションは、音楽オーディオ入力を複数の楽器の音符に同時に変換することを目的としています。これは非常に困難なタスクであり、満足のいく結果を得るためには通常、より複雑なモデルが必要となります。さらに、従来の研究は主に通常の楽器のトランスクリプションに焦点を当てており、ボーカルを無視する傾向がありますが、ボーカルは音楽作品において最も重要な信号源であることが多いです。本論文では、マルチトラックトランスクリプションのためにオーディオ入力の時間-周波数表現をモデル化する新しい深層ニューラルネットワークアーキテクチャ、Perceiver TFを提案します。Perceiver TFは、Perceiverアーキテクチャを拡張し、時間的整合性をモデル化するための追加のTransformer層を導入することで階層的拡張を行います。これにより、提案モデルはPerceiverの利点を継承し、スケーラビリティが向上し、単一のモデルで多くの楽器のトランスクリプションをうまく処理できるようになります。実験では、Perceiver TFを12の楽器クラスおよびボーカルをマルチタスク学習方式でモデル化するように訓練しました。その結果、提案システムが様々な公開データセットにおいて、最新の手法（例：MT3やSpecTNT）を上回る性能を示すことが確認されました。

English

Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously. It is a very challenging task that typically requires a more complex model to achieve satisfactory result. In addition, prior works mostly focus on transcriptions of regular instruments, however, neglecting vocals, which are usually the most important signal source if present in a piece of music. In this paper, we propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription. Perceiver TF augments the Perceiver architecture by introducing a hierarchical expansion with an additional Transformer layer to model temporal coherence. Accordingly, our model inherits the benefits of Perceiver that posses better scalability, allowing it to well handle transcriptions of many instruments in a single model. In experiments, we train a Perceiver TF to model 12 instrument classes as well as vocal in a multi-task learning manner. Our result demonstrates that the proposed system outperforms the state-of-the-art counterparts (e.g., MT3 and SpecTNT) on various public datasets.

時間周波数パーシバを用いたマルチトラック音楽トランスクリプション

Multitrack Music Transcription with a Time-Frequency Perceiver

要旨

Support