ChatPaper.aiChatPaper

T-MAC:通过表查找在边缘部署低比特LLM实现CPU复兴

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

June 25, 2024
作者: Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang
cs.AI

摘要

在边缘设备上部署大型语言模型(LLMs)对于增强设备上的智能至关重要。权重量化对于减少设备上LLMs的内存占用至关重要。然而,低比特LLMs在推断期间需要低精度权重和高精度激活的混合精度矩阵乘法(mpGEMM)。现有系统缺乏对mpGEMM的本机支持,因此不得不对高精度计算的权重进行去量化。这种间接方式可能导致显著的推断开销。 在本文中,我们介绍了T-MAC,这是一种基于查找表(LUT)的创新方法,旨在在CPU上高效进行低比特LLM(即,量化权重的LLM)推断。T-MAC直接支持mpGEMM,无需去量化,同时消除了所需的乘法并减少了加法。具体来说,T-MAC将传统的数据类型中心的乘法转换为按位表查找,并实现了统一且可扩展的mpGEMM解决方案。 我们基于查找表的内核与权重位宽呈线性比例。在低比特Llama和BitNet模型上评估,与llama.cpp相比,T-MAC的吞吐量增加了最多4倍,能源消耗减少了70%。对于BitNet-b1.58-3B,T-MAC在M2-Ultra上单核心可实现30个令牌/s的生成吞吐量,八核心可实现71个令牌/s,而在Raspberry Pi 5等低端设备上为11个令牌/s,远远超过成年人的平均阅读速度。基于查找表的计算范式的T-MAC为在资源受限的边缘设备上实现低比特LLMs铺平了道路,而不会影响计算效率。该系统的开源地址为https://github.com/microsoft/T-MAC。
English
The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC.

Summary

AI-Generated Summary

PDF121November 28, 2024