JAMER：面向专业游戏引擎的项目级代码框架数据集与基准测试

摘要

当前，基于人工智能的游戏开发在资产生成、玩法设计及基于网页的游戏编程方面取得了显著进展，然而，由于缺乏大规模数据集和确定性评估方法，专业游戏引擎上的项目级代码工程仍基本处于未探索状态。我们提出了JamSet和JamBench，这是首个基于专业游戏引擎构建的项目级游戏代码框架数据集与基准。我们的关键洞察在于，游戏开发限时挑战赛（Game Jam）——即开发者在严格时间限制下构建完整游戏的社区活动——能够产出数千个适用于此目的的开源项目。依托Godot引擎的纯文本格式和无头执行模式，我们设计了一套从文件完整性到运行时行为收集的确定性验证流程，从超过24万个仓库中提炼出8133个已验证项目。其中，300个经过人工验证的项目构成JamBench；其余项目组成JamSet。JamBench定义了主题驱动生成和代码补全任务，评估流程结合了编译通过率、结构完整性评分（SCS）和行为对齐评分（BAS）。对9个前沿模型的评估揭示了项目规模扩大时的能力断崖，运行时通过率从小型项目的80.4%骤降至大型项目的5.7%（Task2a）。代码智能体提高了编译率，但在运行时行为质量上未见提升，这表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为训练数据的有效性。所有数据和代码均已公开。

English

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.