PianoKontext: 从平淡语境中生成富有表现力的演奏

摘要

表现力演奏渲染（EPR）旨在根据音符序列生成逼真的演奏效果。然而，现有的流匹配音频编辑模型仅能处理持续时间相同的同步音乐样本，从而限制了其对表现力时值的理解。我们提出PianoKontext，一种面向古典钢琴音乐的流匹配渲染模型，它在预训练Music2Latent模型的潜在空间中生成可变长度的演奏。我们将MIDI乐谱合成为平铺直叙的音频，并在潜在空间中采用动态时间规整（DTW）构建用于训练的配对数据。这些对齐的嵌入向量在DiT模块中进行拼接，从而以简洁有效的方式学习乐谱与演奏之间的依赖关系。音频样本可访问我们的演示页面：https://realfolkcode.github.io/pianokontext_demo/。

English

Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: https://realfolkcode.github.io/pianokontext_demo/.