Cosa conta nei Transformers? Non tutta l'attenzione è necessaria.

Abstract

Mentre la scalabilità dei grandi modelli linguistici basati su Transformer (LLM) ha dimostrato prestazioni promettenti in varie attività, introduce anche architetture ridondanti, ponendo sfide di efficienza per l'implementazione nel mondo reale. Nonostante il riconoscimento di una certa ridondanza nei LLM, la variabilità della ridondanza tra diverse architetture nei transformer, come strati MLP e di Attention, è poco esplorata. In questo lavoro, indaghiamo la ridondanza tra diversi moduli all'interno dei Transformer, inclusi i Blocchi, gli strati MLP e di Attention, utilizzando una metrica basata sulla similarità. Sorprendentemente, nonostante il ruolo critico degli strati di attention nel distinguere i transformer da altre architetture, abbiamo scoperto che una grande parte di questi strati mostra una similarità eccessivamente alta e può essere potata senza degradare le prestazioni. Ad esempio, Llama-2-70B ha ottenuto un aumento della velocità del 48,4% con solo una diminuzione delle prestazioni del 2,4% potando la metà degli strati di attention. Inoltre, tracciando i checkpoint del modello durante il processo di addestramento, abbiamo osservato che la ridondanza degli strati di attention è intrinseca e costante tra le fasi di addestramento. Inoltre, proponiamo un metodo che abbatte congiuntamente gli strati di Attention e MLP, consentendoci di abbandonare più aggressivamente ulteriori strati. Ad esempio, eliminando 31 strati (Attention + MLP), Llama-2-13B mantiene comunque il 90% delle prestazioni nel compito MMLU. Il nostro lavoro fornisce preziose intuizioni per il futuro design dell'architettura di rete. Il codice è disponibile su: https://github.com/Shwai-He/LLM-Drop.

English

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: https://github.com/Shwai-He/LLM-Drop.

Cosa conta nei Transformers? Non tutta l'attenzione è necessaria.

What Matters in Transformers? Not All Attention is Needed

Abstract

Summary

Support

Support