T-MAC：透過表查找在邊緣部署低位元LLM的CPU復興

摘要

在邊緣設備上部署大型語言模型（LLMs）日益重要，以增強設備上的智能。權重量化對於減少設備上LLMs的記憶體佔用是至關重要的。然而，低位元LLMs在推論期間需要低精度權重和高精度激活的混合精度矩陣乘法（mpGEMM）。現有系統缺乏對mpGEMM的本機支援，因此需要將權重解量化以進行高精度計算。這種間接方式可能導致顯著的推論開銷。本文介紹了T-MAC，一種基於查找表（LUT）的創新方法，旨在實現在CPU上進行高效低位元LLM（即權重量化LLM）推論。T-MAC直接支援mpGEMM，無需解量化，同時消除了所需的乘法並減少了加法。具體來說，T-MAC將傳統的資料類型中心的乘法轉換為位元表查找，並實現了統一和可擴展的mpGEMM解決方案。我們基於查找表的核心與權重位元寬度呈線性關係。在低位元Llama和BitNet模型上進行評估，T-MAC相較於llama.cpp，展示出高達4倍的吞吐量增加和70%的能源消耗減少。對於BitNet-b1.58-3B，T-MAC在M2-Ultra上單核心達到每秒30個標記生成的吞吐量，八核心達到每秒71個標記，而在Raspberry Pi 5等低端設備上達到每秒11個標記，顯著超過成年人的平均閱讀速度。基於查找表的計算範式的T-MAC為在資源受限的邊緣設備上實際部署低位元LLMs鋪平了道路，而不會影響計算效率。系統的開源代碼位於https://github.com/microsoft/T-MAC。

English

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC.

T-MAC：透過表查找在邊緣部署低位元LLM的CPU復興

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

摘要

Support