CodeV-R1:推理增强型Verilog代码生成
CodeV-R1: Reasoning-Enhanced Verilog Generation
May 30, 2025
作者: Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
cs.AI
摘要
通过可验证奖励强化学习(RLVR)训练的大型语言模型(LLMs)在具有明确、可自动化验证的任务上取得了突破,如软件编程和数学问题。然而,将RLVR扩展到电子设计自动化(EDA)领域,特别是从自然语言(NL)规范自动生成硬件描述语言(HDLs)如Verilog,面临三大挑战:缺乏自动化且准确的验证环境、高质量NL-代码对的稀缺性,以及RLVR的高昂计算成本。为此,我们推出了CodeV-R1,一个用于训练Verilog生成LLMs的RLVR框架。首先,我们开发了一个基于规则的测试平台生成器,能够对黄金参考进行鲁棒的等价性检查。其次,我们提出了一种往返数据合成方法,将开源的Verilog片段与LLM生成的NL描述配对,通过生成的测试平台验证代码-NL-代码的一致性,并过滤掉不等价的示例,从而获得高质量数据集。第三,我们采用了两阶段“蒸馏后RL”的训练流程:蒸馏用于推理能力的冷启动,随后是自适应DAPO,这是我们新颖的RLVR算法,能够通过自适应调整采样率来降低训练成本。最终模型CodeV-R1-7B在VerilogEval v2和RTLLM v1.1上分别达到了68.6%和72.9%的pass@1,较之前最先进水平提升了12~20%,同时匹配甚至超越了671B的DeepSeek-R1性能。我们将发布我们的模型、训练流程和数据集,以促进EDA和LLM社区的研究。
English
Large language models (LLMs) trained via reinforcement learning with
verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit,
automatable verification, such as software programming and mathematical
problems. Extending RLVR to electronic design automation (EDA), especially
automatically generating hardware description languages (HDLs) like Verilog
from natural-language (NL) specifications, however, poses three key challenges:
the lack of automated and accurate verification environments, the scarcity of
high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To
this end, we introduce CodeV-R1, an RLVR framework for training Verilog
generation LLMs. First, we develop a rule-based testbench generator that
performs robust equivalence checking against golden references. Second, we
propose a round-trip data synthesis method that pairs open-source Verilog
snippets with LLM-generated NL descriptions, verifies code-NL-code consistency
via the generated testbench, and filters out inequivalent examples to yield a
high-quality dataset. Third, we employ a two-stage "distill-then-RL" training
pipeline: distillation for the cold start of reasoning abilities, followed by
adaptive DAPO, our novel RLVR algorithm that can reduce training cost by
adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves
68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively,
surpassing prior state-of-the-art by 12~20%, while matching or even exceeding
the performance of 671B DeepSeek-R1. We will release our model, training
pipeline, and dataset to facilitate research in EDA and LLM communities.Summary
AI-Generated Summary