QuanBench+：LLMベース量子コード生成のための統合マルチフレームワークベンチマーク

要旨

大規模言語モデル（LLM）はコード生成にますます利用されているが、量子コード生成の評価は依然として単一フレームワーク内で行われることが多く、量子推論とフレームワーク習熟度を分離することが困難である。我々はQiskit、PennyLane、Cirqにまたがる統一ベンチマークQuanBench+を提案し、量子アルゴリズム、ゲート分解、状態準備をカバーする42の整合タスクを提供する。実行可能な機能テストでモデルを評価し、Pass@1とPass@5を報告するとともに、確率的出力にはKLダイバージェンスに基づく合格判定を採用する。さらに、フィードバックに基づく修正後のPass@1も調査する。これはモデルがランタイムエラーや誤答後にコードを修正できる仕組みである。フレームワーク横断的な評価では、最高のワンショットスコアはQiskitで59.5%、Cirqで54.8%、PennyLaneで42.9%であった。フィードバック修正後では、各フレームワークの最高スコアはそれぞれ83.3%、76.2%、66.7%に上昇した。これらの結果は明確な進歩を示す一方、信頼性の高いマルチフレームワーク量子コード生成は未解決であり、フレームワーク固有の知識への依存性が依然強いことを示している。

English

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

QuanBench+：LLMベース量子コード生成のための統合マルチフレームワークベンチマーク

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

要旨

Support