LUT-LLM：基于FPGA内存计算的高效大语言模型推理方案

摘要

大语言模型的快速发展推动了众多应用，但高效的单批次推理仍是设备端智能的关键。尽管FPGA具备细粒度数据控制和高能效优势，但近期GPU优化已缩小了其差距，尤其在基于算术运算的场景下。为此，我们利用FPGA丰富的片上存储资源，通过查表操作将LLM推理从算术计算转向存储计算。本文提出LUT-LLM——首个通过向量化内存操作实现1B+大语言模型推理的FPGA加速器。分析表明激活-权重协同量化是最优方案，其技术支撑包括：(1) 带宽感知并行质心搜索；(2) 高效二维查表机制；(3) 最小化数据缓存的时空混合架构。在AMD V80 FPGA上对定制化Qwen 3 1.7B模型的实测显示，LUT-LLM较AMD MI210延迟降低1.66倍，能效较NVIDIA A100提升1.72倍，并可扩展至32B模型，实现较A100 2.16倍的能效增益。

English

The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.

LUT-LLM：基于FPGA内存计算的高效大语言模型推理方案

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

摘要

Support