JAM:一款具備細粒度可控性與美學對齊的微型基於流程的歌曲生成器
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
July 28, 2025
作者: Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria
cs.AI
摘要
扩散模型与流匹配模型近年来彻底革新了自动文本至音频生成技术。这些模型日益能够产生高质量且忠实于原声的音频输出,涵盖语音及声学事件。然而,在主要涉及音乐与歌曲的创意音频生成领域,仍有广阔提升空间。近期开放的歌词至歌曲模型,如DiffRhythm、ACE-Step及LeVo,已在娱乐用途的自动歌曲生成中树立了可接受的标准。然而,这些模型在音乐家工作流程中常需的细粒度词级可控性方面仍显不足。据我们所知,基于流匹配的JAM模型是首个致力于在歌曲生成中赋予词级时间与时长控制能力的尝试,实现了精细的人声操控。为提升生成歌曲质量,使其更贴合人类偏好,我们通过直接偏好优化实施美学对齐,利用合成数据集迭代精炼模型,省去了手动数据标注的需求。此外,我们旨在通过公开评估数据集JAME,标准化此类歌词至歌曲模型的评估流程。我们证明,JAM在音乐特定属性方面优于现有模型。
English
Diffusion and flow-matching models have revolutionized automatic
text-to-audio generation in recent times. These models are increasingly capable
of generating high quality and faithful audio outputs capturing to speech and
acoustic events. However, there is still much room for improvement in creative
audio generation that primarily involves music and songs. Recent open
lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an
acceptable standard in automatic song generation for recreational use. However,
these models lack fine-grained word-level controllability often desired by
musicians in their workflows. To the best of our knowledge, our
flow-matching-based JAM is the first effort toward endowing word-level timing
and duration control in song generation, allowing fine-grained vocal control.
To enhance the quality of generated songs to better align with human
preferences, we implement aesthetic alignment through Direct Preference
Optimization, which iteratively refines the model using a synthetic dataset,
eliminating the need or manual data annotations. Furthermore, we aim to
standardize the evaluation of such lyrics-to-song models through our public
evaluation dataset JAME. We show that JAM outperforms the existing models in
terms of the music-specific attributes.