GBQA：评估大语言模型作为质量保证工程师的游戏基准

摘要

在现代软件开发中，自主发现程序错误仍是一项重大挑战。与代码生成相比，动态运行时环境的复杂性使得大型语言模型（LLM）在错误发现方面面临更大困难。本文以游戏开发为代表性领域，推出游戏质量保障基准（GBQA），该基准包含30款游戏和124个经人工验证的错误，分为三个难度等级，用于评估LLM能否自主检测软件错误。该基准通过多智能体系统以可扩展的方式开发游戏并注入错误，并由领域专家参与循环验证以确保正确性。此外，我们提供了配备多轮ReAct循环与记忆机制的交互式基线智能体，使其能够对游戏环境进行长程探索以实现跨LLM的错误检测。基于前沿LLM的大规模实验表明，自主错误发现仍极具挑战性：表现最佳的Claude-4.6-Opus思维模式仅能识别48.39%的已验证错误。我们相信GBQA提供了充分的测试平台与评估标准，其进一步突破将有助于缩小自主软件工程领域的现有差距。

English

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.