LUT-LLM:基于FPGA内存计算的高效大语言模型推理方案
LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs
November 9, 2025
作者: Zifan He, Shengyu Ye, Rui Ma, Yang Wang, Jason Cong
cs.AI
摘要
大语言模型的快速发展推动了众多应用,但高效的单批次推理仍是设备端智能的关键。尽管FPGA具备细粒度数据控制和高能效优势,但近期GPU优化已缩小了其差距,尤其在基于算术运算的场景下。为此,我们利用FPGA丰富的片上存储资源,通过查表操作将LLM推理从算术计算转向存储计算。本文提出LUT-LLM——首个通过向量化内存操作实现1B+大语言模型推理的FPGA加速器。分析表明激活-权重协同量化是最优方案,其技术支撑包括:(1) 带宽感知并行质心搜索;(2) 高效二维查表机制;(3) 最小化数据缓存的时空混合架构。在AMD V80 FPGA上对定制化Qwen 3 1.7B模型的实测显示,LUT-LLM较AMD MI210延迟降低1.66倍,能效较NVIDIA A100提升1.72倍,并可扩展至32B模型,实现较A100 2.16倍的能效增益。
English
The rapid progress of large language models (LLMs) has advanced numerous
applications, yet efficient single-batch inference remains vital for on-device
intelligence. While FPGAs offer fine-grained data control and high energy
efficiency, recent GPU optimizations have narrowed their advantage, especially
under arithmetic-based computation. To overcome this, we leverage FPGAs'
abundant on-chip memory to shift LLM inference from arithmetic- to memory-based
computation through table lookups. We present LUT-LLM, the first FPGA
accelerator enabling 1B+ LLM inference via vector-quantized memory operations.
Our analysis identifies activation-weight co-quantization as the most effective
scheme, supported by (1) bandwidth-aware parallel centroid search, (2)
efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing
data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B
model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher
energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency
gain over A100.