PianoKontext: 無表情な文脈からの表現豊かな演奏レンダリング

要旨

表現力豊かな演奏レンダリング（EPR）は、音符の系列に制約されたリアルな演奏を生成することを目的とする。しかし、フローマッチング音声編集モデルは同一長さに同期された音楽サンプルのみを操作するため、表現的なタイミングの理解が制限されている。我々はPianoKontextを紹介する。これは、事前学習済みMusic2Latentモデルの潜在空間において可変長の演奏を生成する、クラシックピアノ音楽向けのフローマッチングレンダリングモデルである。MIDIスコアを無表情なオーディオに合成し、潜在空間で動的時間伸縮法（DTW）を用いて学習用のペアデータを構築する。整列された埋め込みはDiTブロック内で連結され、スコアと演奏の間の依存関係をシンプルかつ効果的に学習できる。音声サンプルはデモページ（https://realfolkcode.github.io/pianokontext_demo/）で公開している。

English

Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: https://realfolkcode.github.io/pianokontext_demo/.