TPI-LLM：在低资源边缘设备上高效地为规模为70B的LLM提供服务

摘要

由于对用户交互数据隐私的担忧，大型模型推理正从云端转向边缘。然而，边缘设备通常面临计算能力、内存和带宽有限的问题，需要跨多个设备进行协作以运行和加速大型模型推理。流水线并行，作为主流解决方案，在单用户场景下效率低下，而张量并行则在频繁通信方面遇到困难。本文认为，在低资源设备上，张量并行可能比流水线更有效，并提出了一种计算和内存高效的张量并行推理系统，名为TPI-LLM，用于服务70亿规模的模型。TPI-LLM将敏感原始数据保留在用户设备本地，并引入滑动窗口内存调度器，在推理过程中动态管理层权重，使磁盘I/O延迟与计算和通信重叠。这使得更大的模型可以在内存有限的设备上平稳运行。我们分析了通信瓶颈，并发现链路延迟，而非带宽，成为主要问题，因此实施了基于星型的全局归约算法。通过对模拟和真实测试平台上的大量实验，TPI-LLM相比于Accelerate的首个标记时间和标记延迟减少了超过80％，相比于Transformers和Galaxy减少了超过90％，同时将Llama 2-70B的峰值内存占用减少了90％，仅需要3.1GB内存来运行70亿规模的模型。

English

Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.

TPI-LLM：在低资源边缘设备上高效地为规模为70B的LLM提供服务

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

摘要

Support