熵作为结构先验：DiT信念空间上的对数屏障如何驱动音乐多样性与发展

摘要

在生成模型中，基于置信度的损失加权通常会被避免，因为当模型自信地犯错时，它会加速误差累积；然而，在监督扩散训练中，这一直觉不再成立。我们引入了Eisbach对数障碍（Eisbach log-barrier），这是一种基于DiT输出空间能量分布熵导出的无参数权重：高熵会抑制梯度，而低熵则保留梯度。将其应用于MusicCaps上对Stable Audio 3 Medium模型的LoRA微调后，意外地发现，与未加权训练相比，该方法产生了更强的主题发展、更清晰的声学区分以及更高的纹理多样性——这与模式坍塌恰恰相反。之所以有效，是因为在监督扩散中，梯度方向被锁定为与真实标签一致，因此置信度仅缩放步长；同时，时间熵会降低平坦样本的权重，同时保留高对比度的样本。其结果是，一个完全基于前向传播而自发涌现的在线、自参照数据课程，并附有已分析的噪声层动态与可检验的预测。

English

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.