理解并加速解耦式大语言模型推理的内存处理流程
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
March 30, 2026
作者: Zifan He, Rui Ma, Yizhou Sun, Jason Cong
cs.AI
摘要
现代大型语言模型(LLMs)日益依赖高效的长上下文处理与生成机制——包括稀疏注意力、检索增强生成(RAG)和压缩上下文记忆——以支持复杂推理。我们证明这些优化技术可统一为四步记忆处理流程:准备记忆、计算相关性、检索记忆、应用于推理。通过系统性能分析,我们发现LLM推理中存在22%-97%的记忆处理开销,且其计算特征具有显著异构性。基于此洞见,我们论证异构系统非常适合加速记忆处理,从而提升端到端推理效率。我们在GPU-FPGA异构系统上实现这一方案:将稀疏、不规则及内存受限的操作卸载至FPGA,同时将计算密集型操作保留在GPU。在AMD MI210 GPU与Alveo U55C FPGA的测试表明,相较于GPU基准方案(在NVIDIA A100上亦获得类似结果),我们的系统在多种LLM推理优化中速度提升1.04-2.2倍,能耗降低1.11-4.7倍。这些结果确立了异构系统作为高效LLM记忆处理的可行方向,并为未来异构硬件设计提供了重要参考。
English
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that heterogeneous systems are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is 1.04sim2.2times faster and requires 1.11sim4.7times less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.