人間の嗜好報酬を用いたテキストからの音楽生成の改善

要旨

我々は、ICME 2026で開催されるAcademic Text-to-Music (ATTM) Grand Challengeの効率性トラックへの参加内容について述べる。チャレンジプロトコルで規定されたFAD-CLAPスコアおよびCLAPスコアに加え、我々はTuneJury（オープンな音楽嗜好データセットで学習されたツインペアワイズランカー）から得られた学習済みの人間の嗜好報酬を導入する。この報酬は、訓練時の条件付け信号とサンプル選択基準の両方として機能する。本パイプラインは、120MパラメータのFluxAudio-Sバックボーンに対する5つの工学的判断（訓練時に4つ、推論時に1つ）を組み合わせている：(i) 推論時のCFG軸としても機能する訓練時報酬条件付け、(ii) 5種類のスコア条件付けアーキテクチャの網羅的探索（訓練と推論で異なるバリアントを使用）、(iii) 上位十分位に対するエキスパート反復、(iv) 音声-テキストアライメントのための短い嗜好調整パス（CRPO）、(v) 結合CFG、音源分離、ラウドネス正規化による推論後処理。100件のSong Describerプロンプトに対する段階別分解の結果、訓練時報酬条件付けは機能的な条件付け軸として有効であり、エキスパート反復が最大の貢献要因であること、嗜好調整パスはノイズレベルの改善に留まること、そして推論時のスコアスカラーはパイプラインの最終段階までに既に飽和していることが示された。

English

We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.