路由器调整：一种简单而有效的方法，用于在Transformer中实现动态深度

摘要

传统的Transformer模型通常为每个输入token分配固定数量的计算资源，导致计算效率低下且存在不必要的计算。为解决这一问题，引入了深度混合（MoD）来动态调整计算深度，通过跳过不重要的层。尽管MoD具有潜力，但当前的方法仍未得到充分探索，并面临两个主要挑战：（1）由于需要训练整个模型以及确定要跳过哪些层的路由器，导致高训练成本；（2）当重要层被绕过时，存在性能下降的风险。针对第一个问题，我们提出了路由器调整（Router-Tuning）方法，仅在小数据集上微调路由器，大幅减少了与完整模型训练相关的计算开销。针对第二个挑战，我们提出了MindSkip，采用带有动态深度的注意力机制。该方法在显著提高计算和内存效率的同时保持了模型的性能。大量实验证明，我们的方法提供了竞争性结果，同时显著提高了计算效率，例如加速21％，仅0.2％的性能下降。代码已发布在https://github.com/CASE-Lab-UMD/Router-Tuning。

English

Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys Attention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21\% speedup and only a 0.2\% performance drop. The code is released at https://github.com/CASE-Lab-UMD/Router-Tuning.

路由器调整：一种简单而有效的方法，用于在Transformer中实现动态深度

Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

摘要

Support