TidalDecode: Decodificación LLM Rápida y Precisa con Atención Esparsa Persistente de Posición

Resumen

Los modelos de lenguaje grandes (LLMs) han impulsado avances significativos en diversas tareas de procesamiento del lenguaje natural (NLP), con modelos de largo contexto ganando prominencia para manejar entradas extendidas. Sin embargo, el aumento del tamaño de la caché clave-valor (KV) requerido por las arquitecturas Transformer intensifica las limitaciones de memoria, especialmente durante la fase de decodificación, creando un cuello de botella significativo. Los mecanismos de atención dispersa existentes diseñados para abordar este cuello de botella tienen dos limitaciones: (1) a menudo no logran identificar de manera confiable los tokens más relevantes para la atención, y (2) pasan por alto la coherencia espacial de la selección de tokens a lo largo de capas Transformer consecutivas, lo que puede llevar a una degradación del rendimiento y a un sobrecoste sustancial en la selección de tokens. Este artículo presenta TidalDecode, un algoritmo y sistema simple pero efectivo para la decodificación rápida y precisa de LLM a través de una atención dispersa persistente en la posición. TidalDecode aprovecha la coherencia espacial de los tokens seleccionados por los métodos de atención dispersa existentes e introduce algunas capas de selección de tokens que realizan atención completa para identificar los tokens con los puntajes de atención más altos, mientras que todas las demás capas realizan atención dispersa con los tokens preseleccionados. Este diseño permite a TidalDecode reducir sustancialmente el sobrecoste de la selección de tokens para la atención dispersa sin sacrificar la calidad de los resultados generados. La evaluación en un conjunto diverso de LLMs y tareas muestra que TidalDecode se acerca al rendimiento generativo de los métodos de atención completa mientras reduce la latencia de decodificación de LLM hasta en un 2.1x.

English

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.

TidalDecode: Decodificación LLM Rápida y Precisa con Atención Esparsa Persistente de Posición

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Resumen

Support