거대 언어 모델의 길들이기: 효율적인 LLM 추론 서비스에 대한 조사

초록

생성형 AI를 위한 대형 언어 모델(LLMs)은 놀라운 발전을 이루며 다양한 분야와 애플리케이션에서 널리 채택된 정교하고 다재다능한 도구로 진화했습니다. 그러나 방대한 파라미터 수로 인한 상당한 메모리 오버헤드와 어텐션 메커니즘의 높은 계산 요구량은 LLM 추론 서비스에서 낮은 지연 시간과 높은 처리량을 달성하는 데 상당한 어려움을 초래합니다. 최근 획기적인 연구를 통해 이 분야의 발전이 크게 가속화되었습니다. 본 논문은 이러한 방법들을 포괄적으로 조사하며, 기본적인 인스턴스 수준 접근법, 심층적인 클러스터 수준 전략, 신흥 시나리오 방향, 그리고 기타 중요하지만 주목받지 못한 영역들을 다룹니다. 인스턴스 수준에서는 모델 배치, 요청 스케줄링, 디코딩 길이 예측, 저장소 관리, 그리고 분리 패러다임을 검토합니다. 클러스터 수준에서는 GPU 클러스터 배포, 다중 인스턴스 부하 분산, 그리고 클라우드 서비스 솔루션을 탐구합니다. 신흥 시나리오에서는 특정 작업, 모듈, 그리고 보조 방법을 중심으로 논의를 구성합니다. 전체적인 개요를 보장하기 위해, 몇 가지 특수하지만 중요한 영역들도 강조합니다. 마지막으로, LLM 추론 서비스 분야를 더욱 발전시킬 수 있는 잠재적인 연구 방향을 제시합니다.

English

Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.

거대 언어 모델의 길들이기: 효율적인 LLM 추론 서비스에 대한 조사

Taming the Titans: A Survey of Efficient LLM Inference Serving

초록

Support