JAM: 세밀한 제어 가능성과 미학적 정렬을 갖춘 소형 플로우 기반 음악 생성기

초록

확산 및 플로우 매칭 모델은 최근 자동 텍스트-오디오 생성 분야에 혁신을 가져왔습니다. 이러한 모델은 음성 및 음향 이벤트를 충실히 포착한 고품질 오디오 출력을 점점 더 잘 생성할 수 있게 되었습니다. 그러나 주로 음악과 노래를 포함하는 창의적인 오디오 생성 분야에서는 여전히 개선의 여지가 많습니다. 최근 DiffRhythm, ACE-Step, LeVo와 같은 오픈 가사-노래 모델들은 레크리에이션용 자동 노래 생성에서 수용 가능한 수준을 설정했습니다. 그러나 이러한 모델들은 음악가들이 작업 과정에서 흔히 원하는 세밀한 단어 수준의 제어 가능성을 제공하지 못합니다. 우리가 아는 한, 플로우 매칭 기반의 JAM은 노래 생성에서 단어 수준의 타이밍과 지속 시간 제어를 가능하게 하여 세밀한 보컬 제어를 제공하는 첫 번째 시도입니다. 생성된 노래의 품질을 향상시켜 인간의 선호도와 더 잘 맞추기 위해, 우리는 합성 데이터셋을 사용하여 모델을 반복적으로 개선하는 직접 선호 최적화(Direct Preference Optimization)를 통해 미적 정렬을 구현함으로써 수동 데이터 주석의 필요성을 없앴습니다. 또한, 우리는 공개 평가 데이터셋 JAME을 통해 이러한 가사-노래 모델의 평가를 표준화하고자 합니다. 우리는 JAM이 음악 특성 측면에서 기존 모델들을 능가함을 보여줍니다.

English

Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.