均值模式尖叫：千层扩散变换器的均值-方差分割残差

摘要

將擴散變換器（DiTs）擴展至數百層時，會引入一種結構性脆弱性：網絡可能進入一種靜默、均值主導的崩潰狀態，這種狀態會使令牌表示同質化並抑制中心變異。通過機制審計，我們將這種崩潰的觸發事件孤立為均值模式尖叫（MMS）。即使訓練看似穩定，MMS仍可能發生，它會對殘差寫入器產生均值一致的向後衝擊，從而打開深層殘差分支並驅使網絡進入均值主導狀態。我們展示這種行為是由這些梯度精確分解為均值一致和中心分量所驅動的，一旦值同質化，Softmax雅可比矩陣的零空間會進一步加劇對注意力對數梯度的結構性抑制。為解決此問題，我們提出了均值-方差分離（MV-Split）殘差，它將獨立獲得的中心殘差更新與洩漏的主幹均值替換相結合。在一個400層的單流DiT上，MV-Split防止了未穩定基線的發散崩潰；它在崩潰前的軌跡上緊密跟隨基線，同時在整個訓練過程中始終顯著優於如LayerScale等令牌各向同性門控方法。最後，我們展示了一個1000層的DiT作為邊界尺度上的規模驗證運行，證明了該架構在極深層次下仍能保持穩定訓練。

English

Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.

均值模式尖叫：千层扩散变换器的均值-方差分割残差

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

摘要

Support