变压器模型中的关键因素是什么？并非所有的注意力都是必要的。

摘要

尽管扩展基于Transformer的大型语言模型（LLMs）已经展示出在各种任务中表现出色，但也引入了冗余的架构，给实际部署带来了效率挑战。尽管一些人认识到了LLMs中的冗余性，但是变压器中不同架构（如MLP和注意力层）中冗余性的变化尚未得到充分探讨。在这项工作中，我们使用基于相似度的度量标准，研究了变压器内不同模块（包括块、MLP和注意力层）之间的冗余性。令人惊讶的是，尽管注意力层在区分变压器与其他架构中扮演了关键角色，我们发现其中很大一部分层展现出过高的相似度，可以进行修剪而不降低性能。例如，Llama-2-70B 通过修剪一半的注意力层，实现了48.4\% 的加速，仅有 2.4\% 的性能下降。此外，通过跟踪模型在训练过程中的检查点，我们观察到注意力层的冗余性是固有的，并且在训练阶段保持一致。此外，我们进一步提出了一种方法，联合丢弃注意力和MLP层，使我们能够更积极地丢弃额外的层。例如，当丢弃 31 层（注意力 + MLP）时，Llama-2-13B 仍在 MMLU 任务上保持了 90\% 的性能。我们的工作为未来网络架构设计提供了宝贵的见解。代码已发布在：https://github.com/Shwai-He/LLM-Drop。

English

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: https://github.com/Shwai-He/LLM-Drop.

变压器模型中的关键因素是什么？并非所有的注意力都是必要的。

What Matters in Transformers? Not All Attention is Needed

摘要

Support