시간-주파수 지각기를 활용한 다중 트랙 음악 트랜스크립션

초록

멀티트랙 음악 트랜스크립션은 음악 오디오 입력을 여러 악기의 악보로 동시에 변환하는 것을 목표로 합니다. 이는 매우 도전적인 과제로, 일반적으로 만족스러운 결과를 얻기 위해 더 복잡한 모델이 필요합니다. 또한, 기존 연구들은 주로 일반 악기의 트랜스크립션에 초점을 맞추어 왔으며, 음악에서 가장 중요한 신호원 중 하나인 보컬을 간과해 왔습니다. 본 논문에서는 멀티트랙 트랜스크립션을 위해 오디오 입력의 시간-주파수 표현을 모델링하는 새로운 심층 신경망 아키텍처인 Perceiver TF를 제안합니다. Perceiver TF는 Perceiver 아키텍처를 개선하여 시간적 일관성을 모델링하기 위해 추가적인 Transformer 계층을 도입한 계층적 확장을 제공합니다. 이에 따라, 우리의 모델은 더 나은 확장성을 갖춘 Perceiver의 장점을 계승하여 단일 모델에서 많은 악기의 트랜스크립션을 잘 처리할 수 있습니다. 실험에서는 Perceiver TF를 12개의 악기 클래스와 보컬을 다중 작업 학습 방식으로 모델링하도록 훈련시켰습니다. 우리의 결과는 제안된 시스템이 다양한 공개 데이터셋에서 최첨단 모델(예: MT3 및 SpecTNT)을 능가함을 보여줍니다.

English

Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously. It is a very challenging task that typically requires a more complex model to achieve satisfactory result. In addition, prior works mostly focus on transcriptions of regular instruments, however, neglecting vocals, which are usually the most important signal source if present in a piece of music. In this paper, we propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription. Perceiver TF augments the Perceiver architecture by introducing a hierarchical expansion with an additional Transformer layer to model temporal coherence. Accordingly, our model inherits the benefits of Perceiver that posses better scalability, allowing it to well handle transcriptions of many instruments in a single model. In experiments, we train a Perceiver TF to model 12 instrument classes as well as vocal in a multi-task learning manner. Our result demonstrates that the proposed system outperforms the state-of-the-art counterparts (e.g., MT3 and SpecTNT) on various public datasets.

시간-주파수 지각기를 활용한 다중 트랙 음악 트랜스크립션

Multitrack Music Transcription with a Time-Frequency Perceiver

초록

Support