V-GameGym:面向代码大语言模型的视觉游戏生成平台
V-GameGym: Visual Game Generation for Code Large Language Models
September 24, 2025
作者: Wei Zhang, Jack Yang, Renshuai Tao, Lingzheng Chai, Shawn Guo, Jiajun Wu, Xiaoming Chen, Ganqu Cui, Ning Ding, Xander Xu, Hu Wei, Bowen Zhou
cs.AI
摘要
大型代码语言模型在编程任务中展现出了卓越的能力,然而当前的基准测试主要集中于单一模态,而非视觉游戏开发。大多数现有的代码相关基准测试评估的是语法正确性和执行准确性,忽视了游戏开发中至关重要的特定指标,如可玩性、视觉美感及用户参与度,这些对于实际部署至关重要。为了弥合当前LLM在算法问题解决和竞技编程方面的能力与实用游戏开发全面需求之间的差距,我们提出了V-GameGym,这是一个包含2,219个高质量样本的综合基准,这些样本源自现实世界仓库,跨越100个主题集群,采用了一种新颖的基于聚类的筛选方法,确保了多样性和结构完整性。此外,我们引入了一个多模态评估框架,配备了一个自动化的LLM驱动管道,用于在完整的UI沙盒环境中进行视觉代码合成。我们的深入分析表明,V-GameGym有效地连接了代码生成准确性与实际游戏开发工作流程,为视觉编程和交互元素生成提供了可量化的质量指标。
English
Code large language models have demonstrated remarkable capabilities in
programming tasks, yet current benchmarks primarily focus on single modality
rather than visual game development. Most existing code-related benchmarks
evaluate syntax correctness and execution accuracy, overlooking critical
game-specific metrics such as playability, visual aesthetics, and user
engagement that are essential for real-world deployment. To address the gap
between current LLM capabilities in algorithmic problem-solving and competitive
programming versus the comprehensive requirements of practical game
development, we present V-GameGym, a comprehensive benchmark comprising 2,219
high-quality samples across 100 thematic clusters derived from real-world
repositories, adopting a novel clustering-based curation methodology to ensure
both diversity and structural completeness. Further, we introduce a multimodal
evaluation framework with an automated LLM-driven pipeline for visual code
synthesis using complete UI sandbox environments. Our extensive analysis
reveals that V-GameGym effectively bridges the gap between code generation
accuracy and practical game development workflows, providing quantifiable
quality metrics for visual programming and interactive element generation.