構造的事前分布としてのエントロピー：DiT信念空間における対数障壁が音楽的多様性と発展を駆動する仕組み

要旨

確信度に基づく損失重み付けは、モデルが誤って確信している場合に誤差を加速するため、通常生成モデルでは避けられるが、この直観は教師付き拡散訓練では崩れる。我々は、DiT出力の空間エネルギ分布のエントロピーから導出されるパラメータフリーの重みであるEisbach対数バリアを導入する。高いエントロピーは勾配を減衰させ、低いエントロピーはそれを保持する。これをMusicCaps上のStable Audio 3 MediumのLoRAファインチューニングに適用したところ、予想に反して重み付けなしの訓練よりも強力な主題展開、明確な音響的弁別、高いテクスチャ多様性をもたらし、モード崩壊とは正反対の結果となった。これは、教師付き拡散では勾配方向が正解に固定されるため確信度はステップサイズをスケーリングするだけであり、また時間的エントロピーが平坦なサンプルを減衰させる一方で高コントラストのサンプルを保持するためである。その結果、純粋に順方向パスから出現するオンラインで自己参照的なデータカリキュラムが得られ、ノイズレベルのダイナミクスを解析し、検証可能な予測を提供する。

English

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.