S-LoRA：为数千个并发 LoRA 适配器提供服务

摘要

“先预训练后微调”范式通常被广泛采用在大型语言模型的部署中。低秩适应（LoRA）是一种参数高效的微调方法，经常被用来将基础模型调整到多种任务中，从而产生了大量从一个基础模型派生的LoRA适配器集合。我们观察到这种范式为服务期间的批量推理提供了重要机会。为了充分利用这些机会，我们提出了S-LoRA，这是一个专为可扩展提供多个LoRA适配器而设计的系统。S-LoRA将所有适配器存储在主内存中，并将当前运行查询使用的适配器提取到GPU内存中。为了高效利用GPU内存并减少碎片化，S-LoRA提出了统一分页。统一分页使用统一内存池来管理具有不同秩的动态适配器权重以及具有不同序列长度的KV缓存张量。此外，S-LoRA采用了一种新颖的张量并行策略和高度优化的自定义CUDA核心，用于异构LoRA计算的批处理。总的来说，这些特性使得S-LoRA能够在单个GPU上或跨多个GPU上为数千个LoRA适配器提供服务，而开销很小。与HuggingFace PEFT和vLLM等最先进的库相比（这些库对LoRA服务的支持较为简单），S-LoRA的吞吐量提高了多达4倍，并且服务的适配器数量增加了数个数量级。因此，S-LoRA实现了许多特定任务微调模型的可扩展服务，并为大规模定制微调服务提供了潜力。

English

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.

S-LoRA：为数千个并发 LoRA 适配器提供服务

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

摘要

Support