V-Seek：加速基於開放硬體伺服器級RISC-V平台的大型語言模型推理

摘要

近期大型語言模型（LLMs）的指數級增長主要依賴於基於GPU的系統。然而，CPU正逐漸成為一種靈活且成本更低的替代方案，特別是在針對推理和推理工作負載時。RISC-V因其開放且供應商中立的指令集架構（ISA），在這一領域迅速獲得關注。然而，考慮到特定領域調優的需求，用於LLM工作負載的RISC-V硬件及相應的軟件生態系統尚未完全成熟和優化。本文旨在填補這一空白，專注於在Sophon SG2042上優化LLM推理，這是首款具備向量處理能力的商用多核RISC-V CPU。在針對推理優化的兩個最新頂尖LLM模型——DeepSeek R1 Distill Llama 8B和DeepSeek R1 Distill QWEN 14B上，我們實現了4.32/2.29 token/s的token生成速度和6.54/3.68 token/s的提示處理速度，相比我們的基線，速度提升了最高達2.9倍/3.0倍。

English

The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

V-Seek：加速基於開放硬體伺服器級RISC-V平台的大型語言模型推理

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

摘要

Support