현대 트랜스포머 아키텍처에서의 잔차 스트림 이중성

초록

최근 연구는 잔차 경로가 단순한 최적화 장치가 아니라 모델의 표현 메커니즘의 일부임을 분명히 했습니다. 우리는 이에 동의하지만, 이 설계 공간을 체계화하는 가장 명료한 방법은 트랜스포머를 이중 축 관점에서 보는 것이라고 주장합니다. 디코더는 두 가지 순차적 차원—시퀀스 위치와 계층 깊이—을 따라 정보를 진화시킵니다. 자기 주의는 이미 시퀀스 축을 따른 적응형 혼합을 제공하는 반면, 잔차 스트림은 일반적으로 깊이 축을 따른 고정된 덧셈을 수행합니다. 특정 토큰 위치를 고정하고 계층 인덱스를 순차 변수로 취급하면, 인과적 깊이 방향 잔차 주의 읽기 연산은 시퀀스가 아닌 깊이를 기준으로 작성된다는 점을 제외하면 인과적 단축 슬라이딩 윈도우 주의와 정확히 동일한 지역 연산자입니다. 이것이 바로 Transformer^2의 핵심이 되는 잔차 스트림 이중성입니다. 이 관점은 최근 논문들의 흐름도 명확히 합니다. ELC-BERT와 DenseFormer는 이미 깊이에 대한 학습된 집계가 균일한 잔차 누적을 능가할 수 있음을 보여주었으며, Vertical Attention, DeepCrossAttention(DCA), MUDDFormer 및 Attention Residuals는 더 나아가 이전 계층에 대한 명시적인 주의 기반 라우팅을 지향합니다. 그러나 핵심은 연산자 수준의 이중성이 시스템 수준의 대칭을 의미하지는 않는다는 점입니다. 대규모 자기회귀 모델의 경우, 시퀀스 축 단축 슬라이딩 윈도우 주의는 일반적으로 토큰 측 슬라이딩 윈도우 커널, KV 캐시 레이아웃 및 청크 실행을 재사용하므로 하드웨어 친화적인 배치 방식입니다. 반면 목표가 숏컷 자체를 변경하는 것이라면, 별도의 교차 계층 검색 경로를 추가하는 대신 잔차 연산자를 직접 수정하는 Deep Delta Learning(DDL)이 더 깔끔한 개입 방법입니다. 따라서 우리의 권장사항은 간단합니다: 숏컷 자체가 관심 대상일 때는 DDL을 사용하고, 지역적 적응형 혼합이 목표일 때는 시퀀스 축 단축 슬라이딩 윈도우 주의를 사용하십시오.

English

Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer^2. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

현대 트랜스포머 아키텍처에서의 잔차 스트림 이중성

Residual Stream Duality in Modern Transformer Architectures

초록

Support