CodeV-R1: 추론 강화 Verilog 생성

초록

검증 가능한 보상(RLVR)을 통한 강화 학습으로 훈련된 대규모 언어 모델(LLMs)은 소프트웨어 프로그래밍 및 수학 문제와 같이 명시적이고 자동화 가능한 검증이 필요한 작업에서 획기적인 성과를 달성했습니다. 그러나 RLVR을 전자 설계 자동화(EDA), 특히 자연어(NL) 명세서로부터 Verilog와 같은 하드웨어 기술 언어(HDLs)를 자동으로 생성하는 작업으로 확장하는 데는 세 가지 주요 과제가 있습니다: 자동화되고 정확한 검증 환경의 부재, 고품질 NL-코드 쌍의 희소성, 그리고 RLVR의 과도한 계산 비용이 그것입니다. 이를 위해 우리는 Verilog 생성 LLMs를 훈련하기 위한 RLVR 프레임워크인 CodeV-R1을 소개합니다. 먼저, 우리는 골든 레퍼런스에 대해 강력한 등가 검사를 수행하는 규칙 기반 테스트벤치 생성기를 개발했습니다. 둘째, 오픈소스 Verilog 스니펫을 LLM이 생성한 NL 설명과 짝짓고, 생성된 테스트벤치를 통해 코드-NL-코드 일관성을 검증하며, 등가하지 않는 예제를 걸러내어 고품질 데이터셋을 생성하는 라운드트립 데이터 합성 방법을 제안합니다. 셋째, 우리는 두 단계의 "증류 후 RL" 훈련 파이프라인을 사용합니다: 추론 능력의 콜드 스타트를 위한 증류 단계와, 샘플링 비율을 적응적으로 조정하여 훈련 비용을 줄일 수 있는 우리의 새로운 RLVR 알고리즘인 적응형 DAPO 단계입니다. 결과적으로 나온 모델인 CodeV-R1-7B는 VerilogEval v2와 RTLLM v1.1에서 각각 68.6%와 72.9%의 pass@1을 달성하며, 이전 최첨단 모델을 12~20% 앞서는 동시에 671B DeepSeek-R1의 성능을 따라가거나 오히려 능가합니다. 우리는 EDA 및 LLM 커뮤니티의 연구를 촉진하기 위해 모델, 훈련 파이프라인, 그리고 데이터셋을 공개할 예정입니다.

English

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.

CodeV-R1: 추론 강화 Verilog 생성

CodeV-R1: Reasoning-Enhanced Verilog Generation

초록

Support