CodeV-R1: 推論機能強化型Verilog生成

要旨

検証可能な報酬を用いた強化学習（RLVR）によって訓練された大規模言語モデル（LLM）は、ソフトウェアプログラミングや数学的問題など、明示的で自動化可能な検証を伴うタスクにおいて画期的な成果を達成してきた。しかし、RLVRを電子設計自動化（EDA）、特に自然言語（NL）仕様からVerilogなどのハードウェア記述言語（HDL）を自動生成するタスクに拡張する際には、以下の3つの主要な課題が存在する：自動化された正確な検証環境の欠如、高品質なNL-コードペアの不足、そしてRLVRの計算コストの高さである。これに対処するため、我々はVerilog生成LLMを訓練するためのRLVRフレームワークであるCodeV-R1を提案する。まず、ゴールデンリファレンスに対して堅牢な等価性チェックを行うルールベースのテストベンチジェネレータを開発した。次に、オープンソースのVerilogスニペットとLLMが生成したNL記述をペアリングし、生成されたテストベンチを通じてコード-NL-コードの一貫性を検証し、不等価な例をフィルタリングして高品質なデータセットを生成するラウンドトリップデータ合成手法を提案する。さらに、推論能力のコールドスタートのための蒸留と、サンプリングレートを適応的に調整することで訓練コストを削減できる新規のRLVRアルゴリズムである適応型DAPOを組み合わせた2段階の「蒸留→RL」訓練パイプラインを採用した。その結果得られたモデル、CodeV-R1-7Bは、VerilogEval v2およびRTLLM v1.1においてそれぞれ68.6%と72.9%のpass@1を達成し、従来の最先端モデルを12～20%上回り、671BのDeepSeek-R1の性能に匹敵、あるいはそれを上回る結果を示した。我々は、EDAおよびLLMコミュニティの研究を促進するため、モデル、訓練パイプライン、およびデータセットを公開する予定である。

English

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.

CodeV-R1: 推論機能強化型Verilog生成

CodeV-R1: Reasoning-Enhanced Verilog Generation

要旨

Support