QuanBench+：基於大型語言模型的量子程式碼生成統一多框架基準測試平台

摘要

大型語言模型（LLMs）在程式碼生成領域的應用日益廣泛，然而量子程式碼生成的評估目前仍多侷限於單一框架內，導致難以區分量子推理能力與框架熟悉度。我們提出 QuanBench+——一個橫跨 Qiskit、PennyLane 與 Cirq 框架的統一基準測試集，包含 42 項對齊任務，涵蓋量子演算法、閘極分解與狀態製備三大範疇。我們透過可執行的功能測試評估模型表現，記錄 Pass@1 與 Pass@5 指標，並採用基於 KL 散度的接受標準來驗證機率性輸出。此外，我們特別研究「基於反饋的修正後 Pass@1」情境，讓模型能在遭遇執行錯誤或錯誤答案後重新修正程式碼。跨框架評估顯示：單次生成的最佳成績分別為 Qiskit 59.5%、Cirq 54.8%、PennyLane 42.9%；加入反饋修正機制後，最佳表現提升至 83.3%、76.2% 與 66.7%。這些成果既展現了明顯進展，也揭示可靠的多框架量子程式碼生成仍屬未解難題，且當前表現仍高度依賴框架特定知識。

English

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

QuanBench+：基於大型語言模型的量子程式碼生成統一多框架基準測試平台

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

摘要

Support