现代Transformer架构中的残差流对偶性
Residual Stream Duality in Modern Transformer Architectures
March 17, 2026
作者: Yifan Zhang
cs.AI
摘要
近期研究已明确表明,残差路径并非仅仅是优化管道的组成部分,它更是模型表征机制的重要一环。我们认同这一观点,但主张通过双轴视角来梳理Transformer的设计空间是更为清晰的架构方式。解码器沿着两个有序维度演进信息:序列位置和层间深度。自注意力机制已在序列轴上实现自适应混合,而残差流通常沿深度轴执行固定加法运算。若固定某个词元位置并将层索引视为有序变量,那么因果深度残差注意力读取操作与因果短滑动窗口注意力(ShortSWA)本质上是相同的局部算子,只是其作用域从序列维度转为深度维度。这正是Transformer²背后的核心残差流对偶性。这一视角也澄清了近期研究进展:ELC-BERT和DenseFormer已证明基于深度的学习式聚合能超越均匀残差累积,而垂直注意力、深度交叉注意力(DCA)、MUDDFormer及注意力残差等研究则进一步实现了对浅层特征的显式注意力路由。但关键在于,算子级对偶性并不等同于系统级对称性。对于大规模自回归模型,序列轴ShortSWA通常更具硬件友好性,因其可复用词元侧滑动窗口核函数、KV缓存布局和分块执行机制。若目标在于改变捷径连接本身,深度增量学习(DDL)是更简洁的干预方案,它直接修改残差算子而非添加独立的跨层检索路径。因此我们的建议很明确:当捷径连接是研究目标时采用DDL,当需要局部自适应混合时选用序列轴ShortSWA。
English
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer^2. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.