V-Seek：加速LLM推理在开放硬件服务器级RISC-V平台上的应用

摘要

近期，大型语言模型（LLMs）的指数级增长主要依赖于基于GPU的系统。然而，CPU正逐渐成为一种灵活且成本较低的替代方案，特别是在针对推理和逻辑运算任务时。RISC-V因其开放且厂商中立的指令集架构（ISA），在这一领域迅速获得关注。尽管如此，考虑到特定领域调优的需求，用于LLM工作负载的RISC-V硬件及其相应的软件生态系统尚未完全成熟和优化。本文旨在填补这一空白，重点优化在Sophon SG2042上的LLM推理性能，这是首款具备向量处理能力的商用多核RISC-V CPU。在针对推理优化的两款最新顶尖LLM——DeepSeek R1 Distill Llama 8B和DeepSeek R1 Distill QWEN 14B上，我们实现了4.32/2.29 token/s的令牌生成速度和6.54/3.68 token/s的提示处理速度，相较于基线性能，分别提升了高达2.9倍和3.0倍。

English

The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

V-Seek：加速LLM推理在开放硬件服务器级RISC-V平台上的应用

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

摘要

Support