JAMER: 전문 게임 엔진을 위한 프로젝트 수준 코드 프레임워크 데이터셋 및 벤치마크

초록

현재 AI 기반 게임 개발은 에셋 생성, 게임플레이 설계, 웹 기반 게임 코딩 분야에서 상당한 진전을 이루었으나, 대규모 데이터셋과 결정론적 평가 방법의 부재로 인해 전문 게임 엔진에서의 프로젝트 수준 코드 엔지니어링은 대부분 탐구되지 않은 상태로 남아 있다. 본 연구에서는 전문 게임 엔진 기반의 최초 프로젝트 수준 게임 코드 프레임워크 데이터셋이자 벤치마크인 JamSet과 JamBench를 제시한다. 핵심 통찰은 개발자들이 짧은 시간 제약 내에 완전한 게임을 구축하는 커뮤니티 행사인 게임 잼(Game Jam) 대회가 이 목적에 적합한 수천 개의 오픈소스 프로젝트를 산출한다는 점이다. Godot 엔진의 텍스트 기반 형식과 헤드리스 실행 모드를 활용하여, 파일 무결성 검사부터 런타임 동작 수집까지 결정론적 검증 파이프라인을 설계하고 240,000개 이상의 저장소에서 8,133개의 검증된 프로젝트를 추출했다. 이 중 300개의 수동 검증 프로젝트는 JamBench를 구성하고, 나머지는 JamSet을 구성한다. JamBench는 테마 기반 생성 및 코드 완성 작업을 정의하며, 컴파일 통과율, 구조적 완전성 점수(SCS), 행동 정렬 점수(BAS)를 결합한 파이프라인으로 평가된다. 9개 최첨단 모델 평가 결과, 프로젝트 규모가 증가함에 따라 능력 격차가 나타나 런타임 통과율이 소규모 프로젝트의 80.4%에서 대규모 프로젝트의 5.7%로 급감했다(Task2a). 코드 에이전트는 컴파일율을 향상시키지만 런타임 행동 품질에는 개선을 가져오지 못했으며, 이는 병목 현상이 구문적 정확성이 아닌 아키텍처 설계에 있음을 시사한다. 실험 결과는 JamSet이 효과적인 훈련 데이터임을 입증한다. 모든 데이터와 코드는 공개적으로 제공된다.

English

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.