PRIMA.CPP: 低リソースの日常的なホームクラスターにおける70BスケールLLM推論の高速化

要旨

DeepSeek R1とQwQ 32Bの登場により、家庭用デバイスで最先端の大規模言語モデル（LLM）を実行するための性能障壁が突破されました。消費者向けハードウェアが強化され、モデルの量子化技術が進化している一方で、既存のエンドサイドソリューションは依然としてGPUクラスタ、大容量のRAM/VRAM、および高帯域幅を必要とし、一般的な家庭用クラスタが扱える範囲をはるかに超えています。本論文では、prima.cppを紹介します。これは、CPU/GPUの混合、低容量のRAM/VRAM、Wi-Fi、およびクロスプラットフォームサポートを利用して、日常的な家庭用デバイスで70Bスケールのモデルを実行する分散推論システムです。mmapを使用してモデルの重みを管理し、プリフェッチングを伴うパイプドリング並列処理を導入してディスクローディングを隠蔽します。計算、通信、ディスク、メモリ（およびその管理動作）、OSの異質性をモデル化することで、モデルの各層を各デバイスのCPUとGPUに最適に割り当て、トークンのレイテンシをさらに削減します。このNP困難な割り当て問題を解決するために、Haldaという洗練されたアルゴリズムを提案します。prima.cppを一般的な4ノードの家庭用クラスタで評価した結果、30B以上のモデルにおいてllama.cpp、exo、dllamaを上回り、メモリ負荷を6%未満に抑えました。これにより、Llama 3、DeepSeek R1、Qwen 2.5、QwQなどの最先端の30B-70Bモデルが家庭用アシスタントに導入され、個人が高度なAIを真に利用可能になります。コードはオープンソースで、https://github.com/Lizonghang/prima.cpp で公開されています。

English

Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.

PRIMA.CPP: 低リソースの日常的なホームクラスターにおける70BスケールLLM推論の高速化

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

要旨

Support