熵作為一種結構先驗：DiT信念空間上的對數屏障如何驅動音樂多樣性與發展

摘要

基於置信度的損失加權通常在生成模型中會被避免，因為當模型以高置信度犯錯時會加速錯誤，但這個直覺在監督式擴散訓練中並不成立。我們引入了艾斯巴赫對數障礙，這是一種源自DiT輸出空間能量分佈熵的無參數權重：高熵會衰減梯度，而低熵則保留梯度。將此方法應用於Stable Audio 3 Medium在MusicCaps上的LoRA微調時，意外地產生了比未加權訓練更強的主題發展、更清晰的聲學區分以及更高的紋理多樣性，與模式坍塌完全相反。這是因為在監督式擴散中，梯度方向被鎖定於真實標籤，因此置信度僅會縮放步長；同時時間熵會降低平坦樣本的權重，同時保留高對比度的樣本。其結果是一種完全由前向傳播自然產生的在線、自我參照的數據課程，並附有已分析的噪聲級別動力學及可測試的預測。

English

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.