구조적 사전(prior)으로서의 엔트로피: DiT 신념 공간에 대한 로그 장벽이 음악적 다양성과 발전을 어떻게 촉진하는가

초록

일반적으로 생성 모델에서는 신뢰도 기반 손실 가중치(confidence-based loss weighting)가 사용되지 않는데, 이는 모델이 잘못된 예측을 확신할 때 오류를 가속화하기 때문이다. 그러나 이러한 직관은 지도 확산 학습(supervised diffusion training)에서는 성립하지 않는다. 우리는 DiT 출력의 공간적 에너지 분포 엔트로피로부터 도출된, 매개변수가 필요 없는 가중치인 Eisbach 로그 장벽(Eisbach log-barrier)을 소개한다. 높은 엔트로피는 기울기를 감쇠시키고, 낮은 엔트로피는 이를 보존한다. 이를 MusicCaps 데이터셋에서 Stable Audio 3 Medium의 LoRA 미세 조정에 적용했을 때, 예상외로 가중치 없는 학습보다 더 강한 주제 전개, 명확한 음향 구분, 높은 질감 다양성을 보여주었으며, 이는 모드 붕괴(mode collapse)와는 반대되는 결과이다. 이는 지도 확산 학습에서 기울기 방향이 실제값(ground truth)에 고정되어 있어 신뢰도가 단지 스텝 크기만 조정하고, 시간적 엔트로피가 평평한 샘플은 하향 가중치를 적용하는 반면 대비가 높은 샘플은 보존하기 때문에 작동한다. 그 결과 순전히 순전파(forward pass)에서만 비롯되는 온라인 자기참조적 데이터 커리큘럼(self-referential data curriculum)이 나타나며, 분석된 잡음 수준 동역학(noise-level dynamics)과 검증 가능한 예측이 수반된다.

English

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the model is confidently wrong, but this intuition breaks down in supervised diffusion training. We introduce the Eisbach log-barrier, a parameter-free weight derived from the entropy of the DiT output's spatial energy distribution: high entropy damps the gradient, while low entropy preserves it. Applied to LoRA fine-tuning of Stable Audio 3 Medium on MusicCaps, it unexpectedly yields stronger thematic development, clearer acoustic differentiation, and higher textural diversity than unweighted training, the opposite of mode collapse. This works because in supervised diffusion the gradient direction is locked to ground truth, so confidence only scales the step size, and because temporal entropy downweights flat samples while preserving high-contrast ones. The result is an online, self-referential data curriculum that emerges purely from the forward pass, with analyzed noise-level dynamics and testable predictions.