TuneJury：一種提升音樂生成偏好對齊的開放式指標

摘要

我們介紹 TuneJury，這是一個開放的、基於實例層級的成對獎勵模型，專為文字轉音樂設計，能根據文字提示與音訊片段預測音樂偏好分數。所釋出的檢查點是以公開的人類偏好標籤進行訓練，涵蓋競技場風格（A vs. B）投票、度量對齊偏好對、群眾外包成對比較，以及專家美學評分。兩個片段之間的預測分數差在我們保留的測試集中校準良好，可透過簡單的分數閾值支援資料過濾。TuneJury 能泛化至保留的測試對以及分佈外基準，且在後者上與先前的基線模型保持競爭力。對於訓練後才釋出的生成器，我們引入了錨定校準（anchor calibration），這是一種事後的、每個系統獨立的 Bradley-Terry 校準方法，能以顯著優於從頭重新訓練的資料效率恢復一致性。相同的凍結獎勵在三種下游應用中驅動一致的獎勵軸增益：推理時的最佳 N 選取、DITTO 風格的潛在最佳化，以及專家迭代後訓練。TuneJury 可在 https://github.com/yonghyunk1m/TuneJury 取得。

English

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.