均值模态分离:千层扩散变换器的均值-方差分割残差
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
May 7, 2026
作者: Pengqi Lu
cs.AI
摘要
将扩散变换器(DiTs)扩展到数百层时,会引入一种结构上的脆弱性:网络可能进入一种无声的、均值主导的崩溃状态,这种状态会使得令牌表示同质化并抑制中心化变异。通过机制审计,我们隔离了这种崩溃的触发事件,即均值模式尖叫(MMS)。即使训练过程看似稳定,MMS仍可能发生,表现为对残差写入器产生均值一致的逆向冲击,从而开启深层残差分支,并将网络推向均值主导的状态。我们揭示这种行为是由这些梯度精确分解为均值一致和中心化分量所驱动,加之一旦数值同质化,注意力对数梯度通过Softmax雅可比矩阵的零空间被结构性地抑制,进一步加剧了这一问题。
为解决此问题,我们提出了均值-方差分离(MV-Split)残差,它将独立获得的中心化残差更新与泄漏主干均值替换相结合。在一个400层的单流DiT上,MV-Split防止了未稳定基线崩溃的发散性崩溃;它在崩溃前轨迹上紧密跟踪基线,同时在整个训练计划中显著优于如LayerScale等令牌各向同性门控方法。最后,我们展示了一个1000层的DiT作为边界尺度下的规模验证运行,证明该架构在极端深度下仍能稳定训练。
English
Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize.
To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.