GBQA: 品質保証エンジニアとしての大規模言語モデルを評価するためのゲームベンチマーク

要旨

ソフトウェアバグの自律的発見は、現代のソフトウェア開発において依然として重要な課題である。コード生成と比較して、動的実行環境の複雑さにより、大規模言語モデル（LLM）によるバグ発見は格段に困難となる。本論文では、代表的なドメインとしてゲーム開発に着目し、LLMが自律的にソフトウェアバグを検出できるかを評価するため、30のゲームと124の人間検証済みバグを3つの難易度で包含するGame Benchmark for Quality Assurance（GBQA）を提案する。このベンチマークは、スケーラブルな方法でゲームを開発しバグを注入するマルチエージェントシステムを用いて構築され、人間専門家の監修により正確性が確保されている。さらに、マルチラウンドのReActループとメモリ機構を備えた対話型ベースラインエージェントを提供し、様々なLLMによるゲーム環境の長期的探索を可能にするバグ検証基盤を整備した。先進的LLMを用いた大規模実験により、自律的バグ発見が依然として極めて困難であることが実証された：最高性能モデルであるClaude-4.6-Opus（思考モード）でさえ、検証済みバグの48.39%しか特定できなかった。GBQAは適切なテストベッドと評価基準を提供するものであり、今後の進展が自律的ソフトウェア工学の格差解消に寄与すると確信する。

English

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

GBQA: 品質保証エンジニアとしての大規模言語モデルを評価するためのゲームベンチマーク

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

要旨

Support