ChatPaper.aiChatPaper

Transformer 模型中的重要因素是什麼?並非所有的注意力都是必要的。

What Matters in Transformers? Not All Attention is Needed

June 22, 2024
作者: Shwai He, Guoheng Sun, Zheyu Shen, Ang Li
cs.AI

摘要

儘管擴展基於Transformer的大型語言模型(LLMs)已經展示出在各種任務中具有潛力的表現,但也引入了冗餘結構,對實際部署提出了效率挑戰。儘管在LLMs中認識到了一些冗餘性,但在transformers中不同結構(如MLP和Attention layers)之間的冗餘性變異性尚未得到深入探討。在這項工作中,我們使用基於相似度的度量標準,研究了Transformer中不同模塊(包括Blocks、MLP和Attention layers)之間的冗餘性。令人驚訝的是,儘管注意力層在區分transformers和其他結構中扮演著關鍵角色,我們發現其中大部分層次表現出過高的相似度,可以進行修剪而不降低性能。例如,通過修剪一半的注意力層,Llama-2-70B實現了48.4%的加速,僅性能下降了2.4%。此外,通過跟踪模型檢查點在整個訓練過程中的變化,我們觀察到注意力層的冗餘性是固有的並且在訓練階段之間保持一致。此外,我們進一步提出了一種方法,可以聯合丟棄Attention和MLP層,從而更積極地丟棄額外的層。例如,當丟棄31層(Attention + MLP)時,Llama-2-13B在MMLU任務上仍保留了90%的性能。我們的工作為未來網絡架構設計提供了寶貴的見解。代碼已發布在: https://github.com/Shwai-He/LLM-Drop。
English
While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: https://github.com/Shwai-He/LLM-Drop.

Summary

AI-Generated Summary

PDF323November 16, 2024