QuanBench+: LLM 기반 양자 코드 생성을 위한 통합 다중 프레임워크 벤치마크

초록

대규모 언어 모델(LLM)의 코드 생성 활용이 증가하고 있지만, 양자 코드 생성 평가는 여전히 단일 프레임워크 내에서 주로 이루어져 양자 논리 추론과 프레임워크 숙련도를 분리하기 어렵습니다. 본 연구에서는 Qiskit, PennyLane, Cirq에 걸친 통합 벤치마크인 QuanBench+를 소개합니다. 이 벤치마크는 양자 알고리즘, 게이트 분해, 상태 준비를 아우르는 42개의 정렬된 과제로 구성됩니다. 우리는 실행 가능한 기능 테스트로 모델을 평가하고 Pass@1과 Pass@5를 보고하며, 확률적 출력에 대해 KL-발산 기반 수용 기준을 사용합니다. 추가적으로 피드백 기반 수정 후 Pass@1을 연구하는데, 여기서 모델은 런타임 오류나 잘못된 답변 후 코드를 수정할 수 있습니다. 모든 프레임워크에서 가장 높은 원샷(one-shot) 점수는 Qiskit 59.5%, Cirq 54.8%, PennyLane 42.9%였으며, 피드백 기반 수정 후 최고 점수는 각각 83.3%, 76.2%, 66.7%로 상승했습니다. 이러한 결과는 분명한 진전을 보여주지만, 동시에 신뢰할 수 있는 다중 프레임워크 양자 코드 생성이 여전히 해결되지 않은 과제이며 프레임워크 특정 지식에 크게 의존하고 있음을 나타냅니다.

English

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

QuanBench+: LLM 기반 양자 코드 생성을 위한 통합 다중 프레임워크 벤치마크

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

초록

Support