V-Seek: 오픈 하드웨어 서버급 RISC-V 플랫폼에서의 LLM 추론 가속화

초록

최근 대규모 언어 모델(LLM)의 기하급수적인 성장은 GPU 기반 시스템에 의존해 왔습니다. 그러나 CPU는 특히 추론 및 논리 작업을 대상으로 할 때 유연하고 저비용의 대안으로 부상하고 있습니다. RISC-V는 개방적이고 벤더 중립적인 ISA(Instruction Set Architecture) 덕분에 이 분야에서 빠르게 주목받고 있습니다. 하지만 도메인 특화적 튜닝이 요구됨에 따라 LLM 작업을 위한 RISC-V 하드웨어와 이에 상응하는 소프트웨어 생태계는 아직 완전히 성숙하고 최적화되지 못했습니다. 본 논문은 이러한 격차를 메우고자 하며, 벡터 처리 기능을 갖춘 최초의 상용 다중 코어 RISC-V CPU인 Sophon SG2042에서 LLM 추론을 최적화하는 데 초점을 맞춥니다. 추론을 위해 최적화된 최신 LLM인 DeepSeek R1 Distill Llama 8B와 DeepSeek R1 Distill QWEN 14B에서, 우리는 토큰 생성 시 4.32/2.29 토큰/초, 프롬프트 처리 시 6.54/3.68 토큰/초를 달성했으며, 이는 기준선 대비 최대 2.9배/3.0배의 속도 향상을 보여줍니다.

English

The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.

V-Seek: 오픈 하드웨어 서버급 RISC-V 플랫폼에서의 LLM 추론 가속화

V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

초록

Support