Optimalisatie van Inferentieprestaties voor Grote Taalmodellen op CPU's

Samenvatting

Grote taalmodellen (LLMs) hebben uitzonderlijke prestaties en enorm potentieel getoond bij diverse taken. De inzet van LLMs met hoge prestaties in omgevingen met beperkte middelen heeft echter aanzienlijke aandacht gekregen in de industrie. Wanneer GPU-hardwarebronnen beperkt zijn, kunnen we alternatieve opties op CPU's verkennen. Om de financiële last te verlichten en de beperkingen opgelegd door hardwarebronnen te verminderen, is het optimaliseren van de inferentieprestaties noodzakelijk. In dit artikel introduceren we een eenvoudig implementeerbare oplossing voor het optimaliseren van inferentieprestaties, gericht op het versnellen van LLMs op CPU's. In deze oplossing implementeren we een effectieve manier om de grootte van de KV-cache te verkleinen terwijl de nauwkeurigheid wordt gewaarborgd. We stellen een gedistribueerde inferentie-optimalisatiebenadering voor en implementeren deze op basis van de oneAPI Collective Communications Library. Daarnaast stellen we optimalisatiebenaderingen voor LLMs op CPU voor en voeren we op maat gemaakte optimalisaties uit voor de meest gebruikte modellen. De code is open-source beschikbaar op https://github.com/intel/xFasterTransformer.

English

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

Optimalisatie van Inferentieprestaties voor Grote Taalmodellen op CPU's

Inference Performance Optimization for Large Language Models on CPUs

Samenvatting

Support