Mean Mode Screaming: 1000-레이어 확산 트랜스포머를 위한 평균-분산 분할 잔차

초록

확산 트랜스포머(DiT)를 수백 개 층으로 확장하면 구조적 취약성이 발생한다. 즉, 네트워크가 조용하고 평균이 지배적인 붕괴 상태에 빠져 토큰 표현을 균질화하고 중심화된 변동을 억제할 수 있다. 기계적 감사(mechanistic auditing)를 통해 우리는 이 붕괴의 트리거 이벤트를 평균 모드 스크리밍(Mean Mode Screaming, MMS)으로 식별한다. MMS는 훈련이 안정적으로 보일 때도 발생할 수 있으며, 잔차 기록기(residual writers)에 대한 평균-일관 역전파 충격(mean-coherent backward shock)이 깊은 잔차 분기를 열어 네트워크를 평균이 지배적인 상태로 이끈다. 우리는 이 행동이 이러한 그래디언트를 평균-일관 성분과 중심화된 성분으로 정확히 분해함으로써 유발되며, 값이 균질화되면 소프트맥스 야코비안의 영공간(null space)을 통한 어텐션 로짓 그래디언트의 구조적 억제가 이를 더욱 악화시킨다는 것을 보여준다. 이 문제를 해결하기 위해 우리는 평균-분산 분할(MV-Split) 잔차를 제안한다. 이는 개별 게인 처리된 중심화된 잔차 업데이트와 누수 트렁크 평균 대체(leaky trunk-mean replacement)를 결합한 것이다. 400층 단일 스트림 DiT에서 MV-Split은 안정화되지 않은 기준선을 붕괴시키는 발산 붕괴를 방지한다. 이는 전체 일정에 걸쳐 LayerScale과 같은 토큰-등방성 게이팅 방법보다 훨씬 우수하면서도 기준선의 붕괴 전 궤적에 근접하게 추적한다. 마지막으로, 우리는 1000층 DiT를 경계 규모에서의 규모 검증 실행으로 제시하여, 극한 깊이에서도 아키텍처가 안정적으로 훈련 가능함을 입증한다.

English

Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.

Mean Mode Screaming: 1000-레이어 확산 트랜스포머를 위한 평균-분산 분할 잔차

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

초록

Support