Rule2DRC：使用執行引導測試生成對LLM代理進行DRC腳本合成的基準測試

摘要

可製造的晶片布局必須遵守數千條基於幾何的設計規則，而設計規則檢查（DRC）透過在布局上執行可運行的DRC腳本來強制實施這些規則。將自然語言規則轉換為正確的DRC腳本耗費人力且需要專業知識，這促使LLM代理被用於DRC腳本合成與除錯。然而，現有基準測試的評估集規模較小，且常以程式碼相似度而非執行正確性來評估腳本；此外，先前基於機器學習的方法要么忽略執行反饋，要么需要標記的測試布局作為代理的輸入。為此，我們提出了Rule2DRC，這是一個大規模的DRC腳本編碼代理基準測試，包含1,000個規則轉腳本任務與13,921個評估晶片布局，用於基於執行結果的評分。Rule2DRC提供了一個評估流程，透過DRC執行結果來衡量功能正確性，且無需將評估布局作為代理的輸入。我們還提出了SplitTester，一個用於程式選取的測試代理，它利用執行反饋來生成具區分性的測試案例，並分離先前難以區分的候選腳本，從而顯著提升該領域中N選一最優選擇的效能。我們在 https://github.com/snu-mllab/Rule2DRC 發布了程式碼。

English

Manufacturable chip layouts must satisfy thousands of geometry-based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor-intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning-based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large-scale benchmark for DRC script coding agents with 1,000 rule-to-script tasks and 13,921 evaluation chip layouts for execution-based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best-of-N selection performance in this domain. We release the code at https://github.com/snu-mllab/Rule2DRC.