대규모 언어 모델을 위한 추론 엔진에 관한 연구: 최적화와 효율성 관점에서의 고찰

초록

대규모 언어 모델(LLM)은 챗봇, 코드 생성기, 검색 엔진 등에 널리 적용되고 있습니다. 사고의 연쇄(chain-of-thought), 복잡한 추론, 에이전트 서비스와 같은 작업 부하는 모델을 반복적으로 호출함으로써 추론 비용을 크게 증가시킵니다. 병렬 처리, 압축, 캐싱과 같은 최적화 방법이 비용 절감을 위해 도입되었지만, 다양한 서비스 요구 사항으로 인해 적절한 방법을 선택하기가 어렵습니다. 최근에는 서비스 지향 인프라에 최적화 방법을 통합하기 위한 핵심 구성 요소로 특화된 LLM 추론 엔진이 등장했습니다. 그러나 추론 엔진에 대한 체계적인 연구는 여전히 부족한 상황입니다. 본 논문은 25개의 오픈소스 및 상용 추론 엔진을 종합적으로 평가합니다. 각 추론 엔진을 사용 편의성, 배포 용이성, 범용 지원, 확장성, 처리량 및 지연 시간 인식 계산에 대한 적합성 측면에서 검토합니다. 또한, 각 추론 엔진이 지원하는 최적화 기술을 조사함으로써 설계 목표를 탐구합니다. 더불어, 오픈소스 추론 엔진의 생태계 성숙도를 평가하고 상용 솔루션의 성능 및 비용 정책을 다룹니다. 복잡한 LLM 기반 서비스 지원, 다양한 하드웨어 지원, 강화된 보안 등을 포함한 미래 연구 방향을 제시하여 연구자와 개발자들이 최적화된 LLM 추론 엔진을 선택하고 설계하는 데 실질적인 지침을 제공합니다. 또한, 이 빠르게 진화하는 분야의 발전을 지속적으로 추적하기 위한 공개 저장소를 제공합니다: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

English

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

대규모 언어 모델을 위한 추론 엔진에 관한 연구: 최적화와 효율성 관점에서의 고찰

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

초록

Support