Rule2DRC：面向DRC脚本合成的LLM智能体基准测试，采用执行引导的测试生成

摘要

可制造的芯片布局必须满足数千条基于几何的设计规则，设计规则检查（DRC）通过运行可执行的DRC脚本对布局进行强制验证。将自然语言规则转化为正确的DRC脚本需要耗费大量人力，且要求具备专业知识，这促使研究者利用大语言模型（LLM）代理进行DRC脚本合成与调试。然而，现有基准测试集规模较小，且通常根据代码相似性而非执行正确性来评估脚本；此外，先前基于机器学习的方法要么忽略了执行反馈，要么需要将标注好的测试布局作为代理的输入。为此，我们提出了Rule2DRC——一个面向DRC脚本编码代理的大规模基准测试，包含1000个规则到脚本的任务以及13921个评估用芯片布局，用于基于执行结果的评分。Rule2DRC提供了一套评估流程，通过DRC执行结果衡量功能正确性，且无需将评估布局作为代理的输入。我们还提出了SplitTester，一个用于程序选择的测试代理，它利用执行反馈生成具有判别性的测试用例，并分离出先前无法区分的候选脚本，从而显著提升了该领域的最佳N选一（Best-of-N）性能。我们已在 https://github.com/snu-mllab/Rule2DRC 上发布代码。

English

Manufacturable chip layouts must satisfy thousands of geometry-based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor-intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning-based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large-scale benchmark for DRC script coding agents with 1,000 rule-to-script tasks and 13,921 evaluation chip layouts for execution-based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best-of-N selection performance in this domain. We release the code at https://github.com/snu-mllab/Rule2DRC.