ChatPaper.aiChatPaper

CodeV-R1:推理增强型Verilog生成

CodeV-R1: Reasoning-Enhanced Verilog Generation

May 30, 2025
作者: Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
cs.AI

摘要

通過可驗證獎勵的強化學習(RLVR)訓練的大型語言模型(LLMs)在具有明確且可自動化驗證的任務上取得了突破,例如軟件編程和數學問題。然而,將RLVR擴展至電子設計自動化(EDA),尤其是從自然語言(NL)規格自動生成硬件描述語言(HDLs)如Verilog,面臨三個關鍵挑戰:缺乏自動化且準確的驗證環境、高質量的NL-代碼對稀缺,以及RLVR的計算成本過高。為此,我們引入了CodeV-R1,一個用於訓練Verilog生成LLMs的RLVR框架。首先,我們開發了一個基於規則的測試平台生成器,能夠對黃金參考進行穩健的等價檢查。其次,我們提出了一種往返數據合成方法,將開源Verilog片段與LLM生成的NL描述配對,通過生成的測試平台驗證代碼-NL-代碼的一致性,並過濾掉不等價的示例以產生高質量數據集。第三,我們採用了一個兩階段的“蒸餾後RL”訓練管道:蒸餾用於推理能力的冷啟動,隨後是自適應DAPO,這是我們新穎的RLVR算法,能夠通過自適應調整採樣率來降低訓練成本。最終的模型CodeV-R1-7B在VerilogEval v2和RTLLM v1.1上分別達到了68.6%和72.9%的pass@1,超越了先前最先進的模型12~20%,同時匹配甚至超過了671B DeepSeek-R1的性能。我們將發布我們的模型、訓練管道和數據集,以促進EDA和LLM社區的研究。
English
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.
PDF82June 3, 2025