ChatPaper.aiChatPaper

LUT-LLM:基于FPGA内存计算实现高效大型语言模型推理

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

November 9, 2025
作者: Zifan He, Shengyu Ye, Rui Ma, Yang Wang, Jason Cong
cs.AI

摘要

大型语言模型(LLM)的快速发展推动了众多应用,但高效的单批次推理仍是设备端智能实现的关键。尽管FPGA具备细粒度数据控制和高能效优势,但近期GPU优化已缩小了其领先差距,尤其在基于算术运算的场景下。为此,我们利用FPGA丰富的片上存储资源,通过查表操作将LLM推理从算术计算转向存储计算。本文提出LUT-LLM——首个通过向量化内存操作实现10亿参数级LLM推理的FPGA加速器。通过分析确定激活-权重协同量化是最优方案,并辅以三大技术支撑:(1) 带宽感知并行质心搜索;(2) 高效二维查表机制;(3) 时空混合架构最小化数据缓存。在AMD V80 FPGA上对定制化Qwen 3 1.7B模型的实测表明,LUT-LLM相较AMD MI210延迟降低1.66倍,较NVIDIA A100能效提升1.72倍,且可扩展至320亿参数模型,相较A100实现2.16倍能效增益。
English
The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.
PDF52December 2, 2025