Fiddler：用於混合專家模型快速推論的CPU-GPU協調

摘要

基於專家混合（MoE）架構的大型語言模型（LLMs）在各種任務上展現出令人期待的性能。然而，在資源受限的環境中運行這些模型，即 GPU 記憶體資源不充足的情況下，由於模型大小巨大，是一項具有挑戰性的任務。現有的將模型權重卸載到 CPU 記憶體的系統，面臨著在 CPU 與 GPU 之間頻繁移動數據所帶來的顯著開銷。本文提出了一種名為 Fiddler 的資源高效推論引擎，用於 MoE 模型的 CPU-GPU 協調。Fiddler 的關鍵思想是利用 CPU 的計算能力來最小化 CPU 與 GPU 之間的數據移動。我們的評估顯示，Fiddler 能夠在單個 GPU 上以每秒超過 3 個標記的速度運行未壓縮的 Mixtral-8x7B 模型，該模型參數超過 90GB，這顯示相對於現有方法，有了一個數量級的改進。Fiddler 的程式碼可在以下網址公開獲取：https://github.com/efeslab/fiddler

English

Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CPU and GPU. In this paper, we propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Our evaluation shows that Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in parameters, to generate over 3 tokens per second on a single GPU with 24GB memory, showing an order of magnitude improvement over existing methods. The code of Fiddler is publicly available at https://github.com/efeslab/fiddler

Fiddler：用於混合專家模型快速推論的CPU-GPU協調

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

摘要

Summary

Support

Support