ChatPaper.aiChatPaper

GameCraft-Bench:智能体能否在真实游戏引擎中端到端地构建可玩游戏?

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

June 16, 2026
作者: Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Kyle Li, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xinyuan Wang, Tianyi Bai, Ziniu Li, Benyou Wang
cs.AI

摘要

游戏生成是编码代理的一种新兴应用,要求模型将自然语言规范转化为可玩的交互式系统。与传统编码任务不同,游戏生成发生在游戏引擎内,脚本、场景、资源、渲染及运行时交互需共同产生连贯的游戏体验。我们将端到端游戏生成形式化为一个完整游戏制品的生成问题,该制品通过在目标环境中可观察的玩家-游戏交互来落实规范。我们认为,评估这一场景需满足三个必要条件:引擎锚定性、制品完整性及交互可验证性。我们提出一种基于交互锚定的评估框架,通过回放演示和基于量规的多模态评判来评估可执行的游戏玩法。我们将该框架实例化为GameCraft-Bench,这是一个包含15个游戏家族、共计140个Godot任务的基准测试集。对前沿编码代理的评估表明,端到端游戏生成仍极具挑战性:最强代理仅取得41.46%的得分,多数代理得分低于40%。进一步分析显示,尽管代理常能实现可识别的机制,但在提供内容充足、功能视觉反馈有效、呈现连贯的完整游戏方面仍存在困难。演示、代码及数据详见 https://tongxuluo.github.io/gamecraft-bench-website。
English
Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.