V-GameGym:面向代码大型語言模型的可視化遊戲生成平台
V-GameGym: Visual Game Generation for Code Large Language Models
September 24, 2025
作者: Wei Zhang, Jack Yang, Renshuai Tao, Lingzheng Chai, Shawn Guo, Jiajun Wu, Xiaoming Chen, Ganqu Cui, Ning Ding, Xander Xu, Hu Wei, Bowen Zhou
cs.AI
摘要
大型程式語言模型在編程任務中展現了卓越的能力,然而現有的基準測試主要集中於單一模態,而非視覺遊戲開發。大多數現有的程式相關基準測試評估語法正確性和執行準確性,忽略了遊戲開發中至關重要的特定指標,如可玩性、視覺美學和用戶參與度,這些都是實際部署中不可或缺的。為了解決當前大型語言模型在算法問題解決和競技編程方面的能力與實際遊戲開發全面需求之間的差距,我們提出了V-GameGym,這是一個包含2,219個高質量樣本的綜合基準測試,涵蓋100個源自真實世界資源庫的主題集群,採用了一種新穎的基於聚類的策展方法,以確保多樣性和結構完整性。此外,我們引入了一個多模態評估框架,配備了自動化的LLM驅動管道,用於在完整的UI沙盒環境中進行視覺程式碼合成。我們廣泛的分析顯示,V-GameGym有效地彌合了程式碼生成準確性與實際遊戲開發工作流程之間的差距,為視覺編程和互動元素生成提供了可量化的質量指標。
English
Code large language models have demonstrated remarkable capabilities in
programming tasks, yet current benchmarks primarily focus on single modality
rather than visual game development. Most existing code-related benchmarks
evaluate syntax correctness and execution accuracy, overlooking critical
game-specific metrics such as playability, visual aesthetics, and user
engagement that are essential for real-world deployment. To address the gap
between current LLM capabilities in algorithmic problem-solving and competitive
programming versus the comprehensive requirements of practical game
development, we present V-GameGym, a comprehensive benchmark comprising 2,219
high-quality samples across 100 thematic clusters derived from real-world
repositories, adopting a novel clustering-based curation methodology to ensure
both diversity and structural completeness. Further, we introduce a multimodal
evaluation framework with an automated LLM-driven pipeline for visual code
synthesis using complete UI sandbox environments. Our extensive analysis
reveals that V-GameGym effectively bridges the gap between code generation
accuracy and practical game development workflows, providing quantifiable
quality metrics for visual programming and interactive element generation.