在英特尔GPU上实现高效的LLM推理解决方案

摘要

基于Transformer的大型语言模型（LLMs）已被广泛应用于许多领域，LLM推理效率成为实际应用中的热门话题。然而，LLMs通常在模型结构上设计复杂，具有大量操作，并以自回归模式执行推理，这使得设计高效系统成为一项具有挑战性的任务。在本文中，我们提出了一种具有低延迟和高吞吐量的高效LLM推理解决方案。首先，我们通过融合数据移动和逐元素操作简化了LLM解码器层，以减少内存访问频率并降低系统延迟。我们还提出了一种分段KV缓存策略，将请求和响应令牌的键/值保留在单独的物理内存中，以实现有效的设备内存管理，有助于增大运行时批处理大小并提高系统吞吐量。我们设计了一个定制的基于缓存解决方案的缩放点积注意力核心，以匹配我们的融合策略。我们在Intel GPU上实现了我们的LLM推理解决方案，并将其公开发布。与标准HuggingFace实现相比，所提出的解决方案在Intel GPU上为一些热门LLMs实现了高达7倍的较低令牌延迟和27倍的更高吞吐量。

English

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

在英特尔GPU上实现高效的LLM推理解决方案

Efficient LLM inference solution on Intel GPU

摘要

Support