PianoKontext: 무표정 맥락에서 표현적 연주 렌더링

초록

표현적 연주 렌더링(EPR)은 음표 시퀀스를 기반으로 사실적인 연주를 생성하는 것을 목표로 한다. 그러나 플로우 매칭 오디오 편집 모델은 동일한 길이의 동기화된 음악 샘플만을 조작하므로 표현적 타이밍에 대한 이해가 제한적이다. 본 연구에서는 사전 학습된 Music2Latent 모델의 잠재 공간에서 가변 길이의 연주를 생성하는 클래식 피아노 음악을 위한 플로우 매칭 렌더링 모델인 PianoKontext를 소개한다. MIDI 악보를 무표정 오디오로 합성하고, 잠재 공간에서 동적 시간 워핑(DTW)을 적용하여 훈련용 쌍데이터를 구축한다. 정렬된 임베딩은 DiT 블록에서 연결되어 악보와 연주 간의 의존성을 간단하면서도 효과적으로 학습할 수 있게 한다. 오디오 샘플은 데모 페이지에서 확인할 수 있다: https://realfolkcode.github.io/pianokontext_demo/.

English

Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: https://realfolkcode.github.io/pianokontext_demo/.