ChatPaper.aiChatPaper

S-LoRA:提供數千個同時 LoRA 适配器服務

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

November 6, 2023
作者: Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica
cs.AI

摘要

在部署大型語言模型時,通常採用「預訓練後微調」範式。低秩調適(LoRA)是一種參數高效的微調方法,常用於將基礎模型適應多項任務,從而產生大量LoRA適配器。我們觀察到這種範式在服務期間提供了批次推論的重要機會。為了充分利用這些機會,我們提出了S-LoRA,一個旨在可擴展提供多個LoRA適配器的系統。S-LoRA將所有適配器存儲在主記憶體中,並將當前運行查詢使用的適配器提取到GPU記憶體中。為了有效利用GPU記憶體並減少碎片化,S-LoRA提出了統一分頁。統一分頁使用統一記憶體池來管理具有不同秩和不同序列長度的動態適配器權重和KV快取張量。此外,S-LoRA採用了一種新穎的張量並行策略和高度優化的自定義CUDA內核,用於異構批次處理LoRA計算。這些功能共同使S-LoRA能夠在單個GPU上或跨多個GPU上提供數千個LoRA適配器,並且僅具有輕微開銷。與HuggingFace PEFT和vLLM等最先進的庫(僅具有對LoRA服務的基本支持)相比,S-LoRA的吞吐量提高了多達4倍,並且服務的適配器數量增加了數個數量級。因此,S-LoRA實現了多個特定任務微調模型的可擴展服務,並提供了大規模定制微調服務的潛力。
English
The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.
PDF322December 15, 2024