一閃而過的LLM：具有有限記憶的高效大型語言模型推論

摘要

大型語言模型（LLMs）是現代自然語言處理的核心，在各種任務中表現出色。然而，它們龐大的計算和記憶體需求帶來挑戰，尤其對於記憶體有限的設備而言更是如此。本文解決了運行超出可用DRAM容量的LLMs的效率挑戰，方法是將模型參數存儲在快閃記憶體中，並按需將其帶入DRAM。我們的方法包括構建一個與快閃記憶體行為協調的推理成本模型，引導我們優化兩個關鍵領域：減少從快閃記憶體傳輸的數據量，以及以更大、更連續的塊讀取數據。在這個快閃記憶體資訊框架中，我們引入了兩個主要技術。首先，“窗口化”策略性地通過重複使用先前激活的神經元來減少數據傳輸，其次，“行列捆綁”根據快閃記憶體的順序數據訪問優勢，增加了從快閃記憶體讀取的數據塊大小。這些方法共同使得運行的模型大小可達可用DRAM容量的兩倍，與CPU和GPU中的單純加載方法相比，推理速度分別提高了4-5倍和20-25倍。我們整合了稀疏感知、上下文適應加載和面向硬件的設計，為在記憶體有限的設備上有效推理LLMs鋪平了道路。

English

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their intensive computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this flash memory-informed framework, we introduce two principal techniques. First, "windowing'" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

一閃而過的LLM：具有有限記憶的高效大型語言模型推論

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

摘要

Support