Efficiënte LLM-inferentieoplossing op Intel GPU

Samenvatting

Transformer-gebaseerde Large Language Models (LLMs) worden veelvuldig ingezet in diverse domeinen, en de efficiëntie van LLM-inferentie is een actueel onderwerp in praktische toepassingen. Echter, LLMs zijn doorgaans complex ontworpen in modelstructuur met een groot aantal operaties en voeren inferentie uit in de autoregressieve modus, wat het ontwerpen van een systeem met hoge efficiëntie tot een uitdagende taak maakt. In dit artikel presenteren we een efficiënte LLM-inferentieoplossing met lage latentie en hoge doorvoer. Ten eerste vereenvoudigen we de LLM-decoderlaag door gegevensverplaatsing en elementgewijze operaties te fuseren, waardoor de geheugentoegangsfrequentie wordt verminderd en de systeemlatentie wordt verlaagd. We introduceren ook een segment-KV-cachebeleid om de sleutel/waarde van de aanvraag- en responsetokens in afzonderlijk fysiek geheugen te houden voor effectief geheugenbeheer van het apparaat, wat helpt om de runtime-batchgrootte te vergroten en de systeemdoorvoer te verbeteren. Een aangepaste Scaled-Dot-Product-Attention-kernel is ontworpen om aan te sluiten bij ons fusiebeleid op basis van de segment-KV-cacheoplossing. We implementeren onze LLM-inferentieoplossing op Intel GPU en maken deze publiekelijk beschikbaar. In vergelijking met de standaard HuggingFace-implementatie behaalt de voorgestelde oplossing tot 7x lagere tokenlatentie en 27x hogere doorvoer voor enkele populaire LLMs op Intel GPU.

English

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

Efficiënte LLM-inferentieoplossing op Intel GPU

Efficient LLM inference solution on Intel GPU

Samenvatting

Support