V-Seek: オープンハードウェアサーバークラスRISC-VプラットフォームにおけるLLM推論の高速化

要旨

近年の大規模言語モデル（LLM）の急速な発展は、GPUベースのシステムに依存してきました。しかし、特に推論や推論ワークロードを対象とする場合、CPUは柔軟性と低コストを兼ね備えた代替手段として注目を集めています。この分野では、オープンでベンダー中立なISAを特徴とするRISC-Vが急速に支持を拡大しています。ただし、LLMワークロード向けのRISC-Vハードウェアとそれに対応するソフトウェアエコシステムは、ドメイン固有のチューニングが必要なため、まだ完全に成熟し最適化されていません。本論文はこのギャップを埋めることを目的としており、ベクトル処理能力を備えた初の商用マルチコアRISC-V CPUであるSophon SG2042上でのLLM推論の最適化に焦点を当てています。推論向けに最適化された最新の2つのLLM、DeepSeek R1 Distill Llama 8BとDeepSeek R1 Distill QWEN 14Bにおいて、トークン生成では4.32/2.29トークン/秒、プロンプト処理では6.54/3.68トークン/秒を達成し、ベースラインと比較して最大2.9倍/3.0倍の高速化を実現しました。

English

The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

V-Seek: オープンハードウェアサーバークラスRISC-VプラットフォームにおけるLLM推論の高速化

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

要旨

Support