LUT-LLM：基于FPGA内存计算实现高效大型语言模型推理

摘要

大型语言模型（LLM）的快速发展推动了众多应用，但高效的单批次推理仍是设备端智能实现的关键。尽管FPGA具备细粒度数据控制和高能效优势，但近期GPU优化已缩小了其领先差距，尤其在基于算术运算的场景下。为此，我们利用FPGA丰富的片上存储资源，通过查表操作将LLM推理从算术计算转向存储计算。本文提出LUT-LLM——首个通过向量化内存操作实现10亿参数级LLM推理的FPGA加速器。通过分析确定激活-权重协同量化是最优方案，并辅以三大技术支撑：(1) 带宽感知并行质心搜索；(2) 高效二维查表机制；(3) 时空混合架构最小化数据缓存。在AMD V80 FPGA上对定制化Qwen 3 1.7B模型的实测表明，LUT-LLM相较AMD MI210延迟降低1.66倍，较NVIDIA A100能效提升1.72倍，且可扩展至320亿参数模型，相较A100实现2.16倍能效增益。

English

The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.

LUT-LLM：基于FPGA内存计算实现高效大型语言模型推理

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

摘要

Support