인간 선호 보상을 활용한 텍스트-음악 생성 개선

초록

본 논문은 ICME 2026에서 개최된 ATTM(Academic Text-to-Music) 그랜드 챌린지의 효율성 트랙에 제출한 우리의 접근법을 기술한다. 챌린지 프로토콜에서 요구하는 FAD-CLAP 및 CLAP 점수 외에도, 공개 음악 선호도 데이터셋에서 학습된 쌍체 순위 평가기(Twin Pairwise Ranker)인 TuneJury로부터 얻은 학습된 인간 선호도 보상을 추가로 활용한다. 이 보상은 학습 시 조건화 신호와 샘플 선택 기준이라는 두 가지 역할을 수행한다. 전체 파이프라인은 1억 2천만 개의 파라미터를 가진 FluxAudio-S 백본을 기반으로, 학습 시 4가지, 추론 시 1가지의 총 5가지 엔지니어링 결정을 결합한다: (i) 추론 시 CFG(분류기-자유 가이던스) 축 역할을 겸하는 학습 시 보상 조건화, (ii) 5가지 점수 조건화 아키텍처에 대한 탐색(학습과 추론에 서로 다른 변형 사용), (iii) 상위 10분위 데이터에 대한 전문가 반복(Expert Iteration), (iv) 오디오-텍스트 정렬을 위한 단기 선호도 조정 단계(CRPO), (v) 결합 CFG, 음원 분리 및 음량 정규화를 통한 추론 후처리. 100개의 Song Describer 프롬프트에 대한 단계별 분해 결과, 학습 시 보상 조건화는 기능적 조건화 축으로 작용하며, 전문가 반복이 가장 큰 기여를 하고, 선호도 조정 단계는 잡음 수준의 이득만을 추가하며, 추론 시 점수 스칼라는 체인 종료 시점에서 이미 포화 상태에 도달함을 보여준다.

English

We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.