GameCraft-Bench：代理能否在真實遊戲引擎中端到端構建可玩的遊戲？

摘要

遊戲生成是編程智能體的一項新興應用，要求模型將自然語言規格轉化為可遊玩的互動系統。與傳統編程任務不同，遊戲生成發生在遊戲引擎中，腳本、場景、資源、渲染與執行時互動須共同產生連貫的遊戲體驗。我們將端到端遊戲生成形式化為一個問題：在目標環境中，通過可觀察的玩家與遊戲互動，產生一個完整的遊戲成品，以實現指定的規格。我們主張，評估此場景需要滿足三項必要條件：引擎落地（Engine Grounding）、成品完整性（Artifact Completeness）與互動驗證（Interactive Verification）。我們提出一個基於互動的評估框架，透過重播示範與基於評分指引的多模態評審，對可執行的遊戲玩法進行評估。我們將此框架具體化為GameCraft-Bench，一個包含15個遊戲家族、共140項Godot任務的基準測試。對前沿編程智能體的評估顯示，端到端遊戲生成仍極具挑戰性：最強的智能體僅達到41.46%，大多數智能體得分低於40%。進一步分析表明，儘管智能體常能實作可識別的遊戲機制，但它們在提供內容充足、具功能視覺回饋且呈現連貫的完整遊戲方面仍有困難。詳見 https://tongxuluo.github.io/gamecraft-bench-website 獲取演示、程式碼與數據。

English

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.