V-GameGym：面向代码大型語言模型的可視化遊戲生成平台

摘要

大型程式語言模型在編程任務中展現了卓越的能力，然而現有的基準測試主要集中於單一模態，而非視覺遊戲開發。大多數現有的程式相關基準測試評估語法正確性和執行準確性，忽略了遊戲開發中至關重要的特定指標，如可玩性、視覺美學和用戶參與度，這些都是實際部署中不可或缺的。為了解決當前大型語言模型在算法問題解決和競技編程方面的能力與實際遊戲開發全面需求之間的差距，我們提出了V-GameGym，這是一個包含2,219個高質量樣本的綜合基準測試，涵蓋100個源自真實世界資源庫的主題集群，採用了一種新穎的基於聚類的策展方法，以確保多樣性和結構完整性。此外，我們引入了一個多模態評估框架，配備了自動化的LLM驅動管道，用於在完整的UI沙盒環境中進行視覺程式碼合成。我們廣泛的分析顯示，V-GameGym有效地彌合了程式碼生成準確性與實際遊戲開發工作流程之間的差距，為視覺編程和互動元素生成提供了可量化的質量指標。

English

Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.