コード記号時系列適応はどこまでジャンル特性を伝達できるか？：マルチジャンルコード記号モデリングにおける能力と限界

要旨

ハーモニーは、数学的な音程関係、音響的な協和性、そして音楽の慣習が出会うコンパクトな記号層である。本報告書では、和音記号系列を音楽の完全な表現としてではなく、ジャンル固有の調的和声モデリングのための解釈可能で制御可能な時系列として扱う。ポップ・ジャズのMusic Transformerの凍結チェックポイントを起点に、小さな適応インターフェースがモデルをブルース、ボサノバ、バッハのコラール、カントリー、エレクトロニック、フォーク、ファンク、ゴスペル、ヒップホップ、R&B/ソウル、ロックの11のターゲットジャンルにどの程度拡張できるかを評価する。主な評価では、LoRA、IA3、BitFit、プレフィックスチューニング、そして完全ファインチューニングを11ジャンルと3シードにわたって比較し、完全な165セルのグリッドを構成する。5つの手法すべてが、保持された和音予測において凍結ベースモデルを上回り、マクロゲインは+2.89から+3.61ポイントの範囲であった。LoRAとIA3が最高スコアを示したが、HolmおよびBenjamini-Hochberg補正を適用したWilcoxon検定では決定的な勝者は認められなかった。データサイズを一致させた対照実験により、この結果はさらに明確になる。ジャンルを共通のコーパスサイズにサブサンプリングすると、IA3がトップを維持する一方、LoRAの全データでの優位性は消失し、最下位に転落した。これは、僅差の一部がデータ駆動的であることを示唆している。対照トークンベースラインも強力であり、誤ったジャンルのアダプターもしばしば凍結ベースモデルを上回った。このことから、効果の大部分は特定のアダプターファミリーによるものではなく、再利用可能な和声ベースに対する軽量な条件付けに由来することが示唆される。追加の診断（ランクスイープ、誤ジャンルローテーション、ベースチェックポイントのアブレーション、和音のみのジャンル分類、生成出力統計、実曲評価、重複分析）は、限定的な結論を支持する。すなわち、和音記号の適応はジャンル固有の調的和声予測を確実に改善するが、和音記号だけで完全なジャンルの同一性を伝達するわけではない。したがって、本報告書では、知覚されたジャンルの信憑性や完全な音楽的品質に関する主張は避ける。これらは制御された聴取者または演奏者による評価を必要とする。

English

Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical convention meet. This report treats chord-symbol sequences not as a complete representation of music, but as an interpretable, controllable time series for genre-local harmonic modeling. Starting from a frozen pop-jazz Music Transformer checkpoint, I evaluate how far small adaptation interfaces can extend the model to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction, with macro gains from +2.89 to +3.61 points; LoRA and IA3 score highest, but Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: when genres are sub-sampled to a common corpus size, IA3 stays on top but LoRA's full-data edge disappears and it falls to last, indicating the small gaps are partly data-driven. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting much of the effect comes from lightweight conditioning over a reusable harmonic base rather than one particular adapter family. Additional diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation, chord-only genre classification, generated-output statistics, real-song evaluation, and duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. The report therefore avoids claims about perceived genre authenticity or full musical quality, which require controlled listener or musician evaluation.