분산화된 LLM 추론을 위한 메모리 처리 파이프라인 이해 및 가속화

초록

현대 대규모 언어 모델(LLM)은 복잡한 추론을 지원하기 위해 희소 어텐션, 검색 증강 생성(RAG), 압축된 맥락 메모리와 같은 효율적인 장문 맥락 처리 및 생성 메커니즘에 점점 더 의존하고 있습니다. 본 연구에서는 이러한 최적화 기술이 Prepare Memory, Compute Relevancy, Retrieval, Apply to Inference의 4단계 메모리 처리 파이프라인으로 통합될 수 있음을 보여줍니다. 체계적인 프로파일링을 통해 LLM 추론에서 22%~97%에 달하는 메모리 처리 오버헤드와 그 계산적 특성의 강한 이질성을 확인했습니다. 이러한 통찰을 바탕으로, 이기종 시스템이 메모리 처리 및 궁극적으로 종단 간 추론 가속화에 매우 적합하다고 주장합니다. 우리는 GPU-FPGA 시스템에서 이 접근법을 구현하여 희소하고 불규칙하며 메모리 제한적인 연산은 FPGA로 오프로딩하고 계산 집약적인 연산은 GPU에 유지했습니다. AMD MI210 GPU와 Alveo U55C FPGA에서 평가한 결과, 여러 LLM 추론 최적화 작업에서 GPU 기준 대비 1.04~2.2배 빠른 속도와 1.11~4.7배 적은 에너지 소비를 달성했습니다(유사한 결과가 NVIDIA A100에서도 확인됨). 이러한 결과는 이기종 시스템이 효율적인 LLM 메모리 처리의 실용적인 방향임을 입증하며, 향후 이기종 하드웨어 설계에 유용한 통찰을 제공합니다.

English

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that heterogeneous systems are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is 1.04sim2.2times faster and requires 1.11sim4.7times less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

분산화된 LLM 추론을 위한 메모리 처리 파이프라인 이해 및 가속화

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

초록

Support