学习跳过Transformer的中间层

摘要

条件计算是提升Transformer效率的一种流行策略。现有方法通常针对单个模块（如专家混合层）或独立地跳过某些层。然而，可解释性研究表明，Transformer的中间层表现出更高的冗余性，且早期层将信息聚合到特定token位置。基于这些洞见，我们提出了一种新颖的架构，该架构能够动态地从中间向外跳过可变数量的层。具体而言，一个学习到的门控机制根据输入决定是否绕过中心块的对称跨度，而门控注意力机制则防止后续token关注被跳过的token位置。残差范数通过“三明治”或“每层归一化”方案进行控制，门控稀疏性则通过自适应正则化损失来调节。我们原本旨在降低“较简单”token的计算需求，并可能促进一种多级表征层次结构的自然形成，但在所研究的规模下，与层数较少的密集基线相比，我们的方法在验证交叉熵与估计的浮点运算次数之间的权衡上并未取得改进。我们的代码已发布于https://github.com/tim-lawson/skip-middle。

English

Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.

学习跳过Transformer的中间层

Learning to Skip the Middle Layers of Transformers

摘要

Support