LUT-LLM: FPGA에서 메모리 기반 연산을 통한 효율적인 대규모 언어 모델 추론

초록

대규모 언어 모델(LLM)의 급속한 발전으로 다양한 애플리케이션이 진전되었으나, 온디바이스 인텔리전스를 위해서는 효율적인 단일 배치 추론이 여전히 중요합니다. FPGA는 세밀한 데이터 제어와 높은 에너지 효율을 제공하지만, 최근 GPU 최적화 기술로 인해 특히 산술 연산 기반 컴퓨팅 환경에서 그 이점이 축소되었습니다. 이를 극복하기 위해 우리는 FPGA의 풍부한 온칩 메모리를 활용하여 테이블 탐색을 통해 LLM 추론을 산술 기반에서 메모리 기반 컴퓨팅으로 전환합니다. 우리는 벡터 양자화된 메모리 연산을 통해 10억 파라미터 이상의 LLM 추론을 가능하게 하는 최초의 FPGA 가속기인 LUT-LLM을 제시합니다. 우리의 분석은 활성화-가중치 공동 양자화가 가장 효과적인 기법임을 확인하며, 이를 위해 (1) 대역폭 인식 병렬 중심점 탐색, (2) 효율적인 2차원 테이블 탐색, (3) 데이터 캐싱을 최소화하는 시공간 하이브리드 설계를 지원합니다. 맞춤형 Qwen 3 1.7B 모델에 대해 AMD V80 FPGA로 구현된 LUT-LLM은 AMD MI210 대비 1.66배 낮은 지연 시간을 달성했으며, NVIDIA A100 대비 1.72배 높은 에너지 효율을 보여줍니다. 또한 320억 파라미터 모델로 확장 시 A100 대비 2.16배의 효율 향상을 달성합니다.

English

The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.

LUT-LLM: FPGA에서 메모리 기반 연산을 통한 효율적인 대규모 언어 모델 추론

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

초록

Support