T-MAC: 테이블 조회를 통한 CPU 르네상스 - 엣지 디바이스에서의 저비트 LLM 배포를 위해

초록

에지 디바이스에서 대규모 언어 모델(LLMs)의 배치는 온디바이스 인텔리전스를 강화하기 위해 점점 더 중요해지고 있습니다. 가중치 양자화는 디바이스에서 LLMs의 메모리 사용량을 줄이는 데 중요한 역할을 합니다. 그러나 저비트 LLMs는 추론 과정에서 저정밀도 가중치와 고정밀도 활성화의 혼합 정밀도 행렬 곱셈(mpGEMM)을 필요로 합니다. 기존 시스템은 mpGEMM에 대한 네이티브 지원이 부족하여 고정밀도 계산을 위해 가중치를 역양자화하는 방식을 사용합니다. 이러한 간접적인 방식은 상당한 추론 오버헤드를 초래할 수 있습니다. 이 논문에서는 CPU에서 효율적인 저비트 LLM(즉, 가중치 양자화된 LLM) 추론을 위해 설계된 혁신적인 룩업 테이블(LUT) 기반 방법인 T-MAC을 소개합니다. T-MAC은 역양자화 없이 mpGEMM을 직접 지원하면서 동시에 필요한 곱셈을 제거하고 덧셈을 줄입니다. 구체적으로, T-MAC은 전통적인 데이터 타입 중심의 곱셈을 비트 단위 테이블 룩업으로 변환하고, 통일되고 확장 가능한 mpGEMM 솔루션을 가능하게 합니다. 우리의 LUT 기반 커널은 가중치 비트 폭에 대해 선형적으로 확장됩니다. 저비트 Llama 및 BitNet 모델에서 평가된 T-MAC은 llama.cpp 대비 최대 4배의 처리량 증가와 70%의 에너지 소비 감소를 보여줍니다. BitNet-b1.58-3B의 경우, T-MAC은 M2-Ultra에서 단일 코어로 30 토큰/초, 8코어로 71 토큰/초의 토큰 생성 처리량을 제공하며, Raspberry Pi 5와 같은 저사양 디바이스에서도 11 토큰/초를 달성하여 성인 평균 독해 속도를 크게 초과합니다. LUT 기반 컴퓨팅 패러다임을 갖춘 T-MAC은 계산 효율성을 저하시키지 않으면서 자원이 제한된 에지 디바이스에서 저비트 LLMs의 실용적인 배치를 위한 길을 열어줍니다. 이 시스템은 https://github.com/microsoft/T-MAC에서 오픈소스로 제공됩니다.

English

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC.

T-MAC: 테이블 조회를 통한 CPU 르네상스 - 엣지 디바이스에서의 저비트 LLM 배포를 위해

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

초록

Support