Verbeteren van tekst-naar-muziek generatie met beloningen voor menselijke voorkeuren

Samenvatting

We beschrijven onze inzending voor de efficiëntietrack van de Academic Text-to-Music (ATTM) Grand Challenge op ICME 2026. Naast de FAD-CLAP en CLAP-score van het uitdagingsprotocol voegen we een geleerde menselijke voorkeursbeloning toe van TuneJury, een tweeling paarsgewijze ranker getraind op open muziekvoorkeursdatasets. De beloning dient zowel als een conditioneringssignaal tijdens training als als een selectiecriterium voor steekproeven. De pijplijn combineert vijf technische beslissingen op een FluxAudio-S backbone van 120M parameters, vier tijdens training en één bij inferentie: (i) conditionering op basis van beloning tijdens training die tevens dient als een CFG-as tijdens inferentie, (ii) een zoektocht over vijf score-conditioneringsarchitecturen, waarbij training en inferentie verschillende varianten gebruiken, (iii) expertiteratie op het bovenste deciel, (iv) een korte preferentie-afstemmingsronde (CRPO) voor audio-tekst-uitlijning, en (v) inferentie-nabewerking via gezamenlijke CFG, bronafscheiding en luidheidsnormalisatie. Per-stadium decompositie op 100 Song Describer prompts toont dat conditionering op basis van beloning tijdens training functioneert als een functionele conditioneringsas, expertiteratie de dominante bijdrage levert, de preferentie-afstemmingsronde slechts een winst op ruisniveau toevoegt, en de score-scalair tijdens inferentie reeds verzadigd is aan het einde van de keten.

English

We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.