QuanBench+：基于大语言模型的量子代码生成统一多框架基准测试平台

摘要

大型语言模型（LLMs）在代码生成中的应用日益广泛，然而量子代码生成的评估目前仍多局限于单一框架内，导致难以区分量子推理能力与框架熟悉度。我们推出QuanBench+统一基准测试集，涵盖Qiskit、PennyLane和Cirq三大框架，包含42个对齐任务，涉及量子算法、门分解和态制备三大类别。我们通过可执行的功能测试评估模型性能，报告Pass@1和Pass@5指标，并对概率性输出采用基于KL散度的接受准则。此外还研究了基于反馈的修复后的Pass@1表现，即模型可在出现运行时错误或错误答案后修正代码。跨框架评估显示：单次生成的最佳得分在Qiskit达59.5%，Cirq达54.8%，PennyLane达42.9%；而引入反馈修复机制后，最佳成绩分别提升至83.3%、76.2%和66.7%。这些结果既展现了显著进展，也表明可靠的多框架量子代码生成尚未实现，其表现仍高度依赖特定框架知识。

English

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

QuanBench+：基于大语言模型的量子代码生成统一多框架基准测试平台

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

摘要

Support