V-GameGym：面向代码大语言模型的视觉游戏生成平台

摘要

大型代码语言模型在编程任务中展现出了卓越的能力，然而当前的基准测试主要集中于单一模态，而非视觉游戏开发。大多数现有的代码相关基准测试评估的是语法正确性和执行准确性，忽视了游戏开发中至关重要的特定指标，如可玩性、视觉美感及用户参与度，这些对于实际部署至关重要。为了弥合当前LLM在算法问题解决和竞技编程方面的能力与实用游戏开发全面需求之间的差距，我们提出了V-GameGym，这是一个包含2,219个高质量样本的综合基准，这些样本源自现实世界仓库，跨越100个主题集群，采用了一种新颖的基于聚类的筛选方法，确保了多样性和结构完整性。此外，我们引入了一个多模态评估框架，配备了一个自动化的LLM驱动管道，用于在完整的UI沙盒环境中进行视觉代码合成。我们的深入分析表明，V-GameGym有效地连接了代码生成准确性与实际游戏开发工作流程，为视觉编程和交互元素生成提供了可量化的质量指标。

English

Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.