GameCraft-Bench:代理能否在真實遊戲引擎中端到端構建可玩的遊戲?
GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?
June 16, 2026
作者: Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Kyle Li, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xinyuan Wang, Tianyi Bai, Ziniu Li, Benyou Wang
cs.AI
摘要
遊戲生成是編程智能體的一項新興應用,要求模型將自然語言規格轉化為可遊玩的互動系統。與傳統編程任務不同,遊戲生成發生在遊戲引擎中,腳本、場景、資源、渲染與執行時互動須共同產生連貫的遊戲體驗。我們將端到端遊戲生成形式化為一個問題:在目標環境中,通過可觀察的玩家與遊戲互動,產生一個完整的遊戲成品,以實現指定的規格。我們主張,評估此場景需要滿足三項必要條件:引擎落地(Engine Grounding)、成品完整性(Artifact Completeness)與互動驗證(Interactive Verification)。我們提出一個基於互動的評估框架,透過重播示範與基於評分指引的多模態評審,對可執行的遊戲玩法進行評估。我們將此框架具體化為GameCraft-Bench,一個包含15個遊戲家族、共140項Godot任務的基準測試。對前沿編程智能體的評估顯示,端到端遊戲生成仍極具挑戰性:最強的智能體僅達到41.46%,大多數智能體得分低於40%。進一步分析表明,儘管智能體常能實作可識別的遊戲機制,但它們在提供內容充足、具功能視覺回饋且呈現連貫的完整遊戲方面仍有困難。詳見 https://tongxuluo.github.io/gamecraft-bench-website 獲取演示、程式碼與數據。
English
Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.