GBQA：一個評估大型語言模型作為品質保證工程師的遊戲基準

摘要

在現代軟體開發中，自主發現程式錯誤仍是一項重大挑戰。與程式碼生成相比，動態執行環境的複雜性使得大型語言模型在錯誤發現任務上面臨更高難度。本文以遊戲開發為代表性領域，提出「遊戲品質保證基準測試」（GBQA），該基準包含30款遊戲及124個經人工驗證的錯誤，分為三個難度等級，用以評估大型語言模型能否自主偵測軟體錯誤。該基準採用多智能體系統建構，可擴展地開發遊戲並注入錯誤，同時引入人工專家驗證以確保正確性。此外，我們設計了一個具備多輪ReAct循環與記憶機制的互動式基礎智能體，使不同大型語言模型能對遊戲環境進行長時序探索以偵測錯誤。針對前沿大型語言模型的廣泛實驗表明，自主錯誤發現仍具高度挑戰性：表現最佳的Claude-4.6-Opus模型在思考模式下僅能識別48.39%的已驗證錯誤。我們認為GBQA提供了適切的測試平台與評估標準，其後續進展將有助於縮小自主化軟體工程領域的現有差距。

English

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

GBQA：一個評估大型語言模型作為品質保證工程師的遊戲基準

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

摘要

Support