Fiddler: Mixture-of-Experts 모델의 빠른 추론을 위한 CPU-GPU 오케스트레이션

초록

Mixture-of-Experts(MoE) 아키텍처 기반의 대형 언어 모델(LLMs)은 다양한 작업에서 유망한 성능을 보여주고 있습니다. 그러나 GPU 메모리 리소스가 충분하지 않은 제한된 환경에서 이러한 모델을 실행하는 것은 모델 크기가 매우 크기 때문에 어려운 과제입니다. 모델 가중치를 CPU 메모리로 오프로드하는 기존 시스템은 CPU와 GPU 간에 데이터를 빈번하게 이동시키는 데 따른 상당한 오버헤드 문제를 겪고 있습니다. 본 논문에서는 MoE 모델을 위한 CPU-GPU 조정 기반의 리소스 효율적 추론 엔진인 Fiddler를 제안합니다. Fiddler의 핵심 아이디어는 CPU의 연산 능력을 활용하여 CPU와 GPU 간의 데이터 이동을 최소화하는 것입니다. 평가 결과, Fiddler는 90GB가 넘는 파라미터를 가진 압축되지 않은 Mixtral-8x7B 모델을 24GB 메모리의 단일 GPU에서 초당 3개 이상의 토큰을 생성할 수 있으며, 이는 기존 방법 대비 수십 배의 성능 향상을 보여줍니다. Fiddler의 코드는 https://github.com/efeslab/fiddler에서 공개되어 있습니다.

English

Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CPU and GPU. In this paper, we propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Our evaluation shows that Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in parameters, to generate over 3 tokens per second on a single GPU with 24GB memory, showing an order of magnitude improvement over existing methods. The code of Fiddler is publicly available at https://github.com/efeslab/fiddler

Fiddler: Mixture-of-Experts 모델의 빠른 추론을 위한 CPU-GPU 오케스트레이션

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

초록

Summary

Support

Support