TuneJury: Een open metriek voor het verbeteren van preferentie-uitlijning bij muziekgeneratie

Samenvatting

We introduceren TuneJury, een open, instance-niveau paarsgewijs beloningsmodel voor tekst-naar-muziek dat een muziekvoorkeursscore voorspelt op basis van een tekstprompt en een audioclip. Het uitgebrachte checkpoint is getraind op openbaar beschikbare menselijke voorkeurslabels die arena-stijl (A vs. B) stemmen, metriek-afstemmingsvoorkeursparen, crowdsourced paarsgewijze vergelijkingen en expertesthetische beoordelingen omvatten. De voorspelde scoremarge tussen twee clips is goed gekalibreerd op onze vaste testsplitsing, wat datafiltering via een eenvoudige scoredrempel ondersteunt. TuneJury generaliseert naar zowel vaste testparen als out-of-distribution benchmarks en blijft concurrerend met eerdere baselines op de laatste. Voor generatoren die na training zijn uitgebracht, introduceren we ankerkalibratie, een post-hoc, per-systeem Bradley-Terry kalibratie die overeenstemming herstelt met aanzienlijk betere data-efficiëntie dan hertraining vanaf nul. Dezelfde bevroren beloning leidt tot consistente beloningsaswinsten in drie downstream-toepassingen: inferentie-tijd beste-van-N selectie, DITTO-stijl latente optimalisatie en expert-iteratie na-training. TuneJury is beschikbaar op https://github.com/yonghyunk1m/TuneJury.

English

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.