平均モードスクリーミング：1000層拡散トランスフォーマーのための平均－分散分割残差

要旨

拡散トランスフォーマー（DiT）を数百層にスケーリングすると、構造的脆弱性が生じる。すなわち、ネットワークが静かな平均支配崩壊状態に陥り、トークン表現が均質化され、中心化された変動が抑制される可能性がある。メカニズム監査を通じて、この崩壊の誘因事象を平均モードスクリーミング（MMS）として特定する。MMSは、学習が安定しているように見える場合でも発生し得る。これは、残差書き込み部に対する平均コヒーレントな逆伝播ショックによって引き起こされ、深い残差分岐を開放し、ネットワークを平均支配状態へと導く。この振る舞いは、これらの勾配を平均コヒーレント成分と中心化成分に正確に分解し、さらに値が均質化されるとソフトマックスヤコビアンの零空間を通じて注意ロジット勾配が構造的に抑制されることによって促進されることを示す。この問題に対処するために、平均分散分割（MV-Split）残差を提案する。これは、個別にゲイン調整された中心化残差更新と、リーキーなトランク平均置換を組み合わせたものである。400層の単一ストリームDiTにおいて、MV-Splitは、安定化されていないベースラインを破綻させる発散崩壊を防止する。その軌道は、ベースラインの破綻前の経路に近く追従し、全スケジュールを通じてLayerScaleなどのトークン等方性ゲーティング手法よりも大幅に優れている。最後に、境界スケールでのスケール検証実行として1000層のDiTを提示し、極度の深さにおいてもアーキテクチャが安定して学習可能であることを実証する。

English

Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.

平均モードスクリーミング：1000層拡散トランスフォーマーのための平均－分散分割残差

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

要旨

Support