PRIMA.CPP: 저사양 일상 가정용 클러스터에서 700억 규모 LLM 추론 가속화

초록

DeepSeek R1과 QwQ 32B의 등장은 가정용 기기에서 최첨단 대규모 언어 모델(LLM)을 실행하는 성능 장벽을 돌파했습니다. 소비자용 하드웨어가 점점 강력해지고 모델 양자화 기술이 개선되고 있지만, 기존의 엔드사이드 솔루션들은 여전히 GPU 클러스터, 대용량 RAM/VRAM, 그리고 높은 대역폭을 요구하며, 일반적인 가정용 클러스터가 감당할 수 있는 범위를 훨씬 넘어섭니다. 본 논문은 prima.cpp를 소개합니다. 이는 CPU/GPU 혼합 사용, 낮은 RAM/VRAM, Wi-Fi, 그리고 크로스 플랫폼 지원을 통해 일상적인 가정용 기기에서 70B 규모의 모델을 실행하는 분산 추론 시스템입니다. 이 시스템은 mmap을 사용하여 모델 가중치를 관리하고, 디스크 로딩을 숨기기 위해 프리페칭이 포함된 파이프드-링 병렬화를 도입했습니다. 계산, 통신, 디스크, 메모리(및 그 관리 동작), 그리고 OS의 이질성을 모델링함으로써, 각 기기의 CPU와 GPU에 모델 레이어를 최적으로 할당하여 토큰 지연 시간을 더욱 줄였습니다. 이 NP-난제 할당 문제를 해결하기 위해 Halda라는 우아한 알고리즘이 제안되었습니다. 우리는 일반적인 4노드 가정용 클러스터에서 prima.cpp를 평가했습니다. 이는 30B 이상의 모델에서 llama.cpp, exo, 그리고 dllama를 능가하면서도 메모리 사용량을 6% 이하로 유지했습니다. 이를 통해 Llama 3, DeepSeek R1, Qwen 2.5, 그리고 QwQ와 같은 최첨단 30B-70B 모델을 가정용 어시스턴트에 도입함으로써, 개인에게도 진정으로 접근 가능한 고급 AI를 제공합니다. 코드는 오픈 소스이며 https://github.com/Lizonghang/prima.cpp에서 확인할 수 있습니다.

English

Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.

PRIMA.CPP: 저사양 일상 가정용 클러스터에서 700억 규모 LLM 추론 가속화

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

초록

Support