TidalDecode: Decodifica LLM Veloce e Accurata con Attenzione Sparsa Persistentemente Posizionale

Abstract

I grandi modelli linguistici (LLM) hanno guidato significativi progressi in varie attività di NLP, con i modelli a lungo contesto che guadagnano importanza per gestire input estesi. Tuttavia, l'espansione della dimensione della cache chiave-valore (KV) richiesta dalle architetture Transformer intensifica i vincoli di memoria, specialmente durante la fase di decodifica, creando un significativo collo di bottiglia. I meccanismi di attenzione sparsa esistenti progettati per affrontare questo collo di bottiglia presentano due limitazioni: (1) spesso non riescono a identificare in modo affidabile i token più rilevanti per l'attenzione e (2) trascurano la coerenza spaziale della selezione dei token attraverso i livelli consecutivi del Transformer, il che può portare a degrado delle prestazioni e a un notevole sovraccarico nella selezione dei token. Questo articolo introduce TidalDecode, un algoritmo e un sistema semplici ma efficaci per la decodifica rapida e accurata dei LLM attraverso un'attenzione sparsa persistente alla posizione. TidalDecode sfrutta la coerenza spaziale dei token selezionati dai metodi di attenzione sparsa esistenti e introduce alcuni livelli di selezione dei token che eseguono un'attenzione completa per identificare i token con i punteggi di attenzione più alti, mentre tutti gli altri livelli eseguono un'attenzione sparsa con i token preselezionati. Questo design consente a TidalDecode di ridurre notevolmente il sovraccarico della selezione dei token per l'attenzione sparsa senza sacrificare la qualità dei risultati generati. La valutazione su un insieme diversificato di LLM e attività mostra che TidalDecode si avvicina strettamente alle prestazioni generative dei metodi di attenzione completa riducendo al contempo la latenza di decodifica dei LLM fino a 2,1 volte.

English

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.

TidalDecode: Decodifica LLM Veloce e Accurata con Attenzione Sparsa Persistentemente Posizionale

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Abstract

Support