JAM：一款具备细粒度可控性与美学对齐的微型基于流式歌曲生成器

摘要

扩散模型与流匹配模型近年来彻底革新了自动文本到音频生成技术。这些模型在生成高质量、忠实捕捉语音及声学事件的音频输出方面日益精进。然而，在主要涉及音乐与歌曲的创意音频生成领域，仍有广阔提升空间。近期开放的歌词到歌曲生成模型，如DiffRhythm、ACE-Step和LeVo，已在娱乐用途的自动歌曲生成中树立了可接受的标准。然而，这些模型在音乐家工作流程中常需的细粒度词级可控性方面仍显不足。据我们所知，基于流匹配的JAM模型是首个致力于在歌曲生成中赋予词级时间与时长控制能力的尝试，实现了精细的声乐控制。为提升生成歌曲质量，使其更贴合人类偏好，我们通过直接偏好优化实施美学对齐，利用合成数据集迭代优化模型，省去了手动数据标注的需求。此外，我们旨在通过公开评估数据集JAME，标准化此类歌词到歌曲模型的评估流程。实验表明，JAM在音乐特定属性上超越了现有模型。

English

Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.

JAM：一款具备细粒度可控性与美学对齐的微型基于流式歌曲生成器

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

摘要

Support