GameCraft-Bench：智能体能否在真实游戏引擎中端到端地构建可玩游戏？

摘要

游戏生成是编码代理的一种新兴应用，要求模型将自然语言规范转化为可玩的交互式系统。与传统编码任务不同，游戏生成发生在游戏引擎内，脚本、场景、资源、渲染及运行时交互需共同产生连贯的游戏体验。我们将端到端游戏生成形式化为一个完整游戏制品的生成问题，该制品通过在目标环境中可观察的玩家-游戏交互来落实规范。我们认为，评估这一场景需满足三个必要条件：引擎锚定性、制品完整性及交互可验证性。我们提出一种基于交互锚定的评估框架，通过回放演示和基于量规的多模态评判来评估可执行的游戏玩法。我们将该框架实例化为GameCraft-Bench，这是一个包含15个游戏家族、共计140个Godot任务的基准测试集。对前沿编码代理的评估表明，端到端游戏生成仍极具挑战性：最强代理仅取得41.46%的得分，多数代理得分低于40%。进一步分析显示，尽管代理常能实现可识别的机制，但在提供内容充足、功能视觉反馈有效、呈现连贯的完整游戏方面仍存在困难。演示、代码及数据详见 https://tongxuluo.github.io/gamecraft-bench-website。

English

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.