TuneJury: 音楽生成における嗜好アライメントを改善するためのオープンメトリクス

要旨

我々は、テキストから音楽を生成するためのオープンでインスタンスレベルのペアワイズ報酬モデルであるTuneJuryを紹介する。TuneJuryは、テキストプロンプトと音声クリップから音楽の嗜好スコアを予測する。公開されたチェックポイントは、アリーナ形式（A対B）の投票、メトリックアラインメントの嗜好ペア、クラウドソーシングによるペアワイズ比較、専門家による美的評価をカバーする公開の人間嗜好ラベルを用いて学習されている。2つのクリップ間の予測スコア差は、保持されたテスト分割において良好に較正されており、単純なスコア閾値を介したデータフィルタリングをサポートする。TuneJuryは、保持されたテストペアと分布外のベンチマークの両方に汎化し、後者では先行ベースラインと競争力のある性能を維持する。学習後にリリースされた生成器に対しては、アンカー較正を導入する。これは事後的な、システムごとのBradley-Terry較正であり、ゼロからの再学習よりもはるかに優れたデータ効率で一致を回復する。同一の凍結された報酬は、推論時のベスト・オブ・N選択、DITTOスタイルの潜在変数最適化、専門家反復による事後学習という3つの下流アプリケーションにおいて、一貫した報酬軸での改善を推進する。TuneJuryはhttps://github.com/yonghyunk1m/TuneJuryで入手可能である。

English

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.