Solução eficiente de inferência de LLM em GPU Intel

Resumo

Os modelos de linguagem de grande escala (LLMs) baseados em Transformers têm sido amplamente utilizados em diversos campos, e a eficiência da inferência de LLMs tornou-se um tópico relevante em aplicações reais. No entanto, os LLMs geralmente possuem uma estrutura de modelo complexa, com operações massivas, e realizam inferência no modo auto-regressivo, o que torna desafiador projetar um sistema com alta eficiência. Neste artigo, propomos uma solução eficiente para inferência de LLMs com baixa latência e alta taxa de transferência. Primeiramente, simplificamos a camada decodificadora do LLM ao fundir movimentações de dados e operações elementares, reduzindo a frequência de acesso à memória e diminuindo a latência do sistema. Também propomos uma política de cache KV segmentado para manter as chaves/valores dos tokens de requisição e resposta em memória física separada, permitindo um gerenciamento eficaz da memória do dispositivo, o que ajuda a aumentar o tamanho do lote em tempo de execução e melhorar a taxa de transferência do sistema. Um kernel personalizado de Scaled-Dot-Product-Attention foi projetado para corresponder à nossa política de fusão com base na solução de cache KV segmentado. Implementamos nossa solução de inferência de LLMs em GPU Intel e a disponibilizamos publicamente. Em comparação com a implementação padrão do HuggingFace, a solução proposta alcança até 7x menos latência por token e 27x maior taxa de transferência para alguns LLMs populares em GPU Intel.

English

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

Solução eficiente de inferência de LLM em GPU Intel

Efficient LLM inference solution on Intel GPU

Resumo

Support