V-GameGym: コード大規模言語モデルのための視覚的ゲーム生成

要旨

大規模言語モデルはプログラミングタスクにおいて顕著な能力を発揮しているが、現在のベンチマークは主に単一モダリティに焦点を当てており、ビジュアルゲーム開発には対応していない。既存のコード関連ベンチマークの多くは、構文の正確性や実行精度を評価するにとどまり、実世界での展開に不可欠なプレイアビリティ、視覚的美観、ユーザーエンゲージメントといったゲーム固有の重要な指標を見落としている。アルゴリズム的問題解決や競技プログラミングにおける現在のLLMの能力と、実践的なゲーム開発の包括的な要件とのギャップを埋めるため、我々はV-GameGymを提案する。これは、実世界のリポジトリから導出された100のテーマ別クラスターにわたる2,219の高品質サンプルからなる包括的なベンチマークであり、多様性と構造的完全性を確保するための新たなクラスタリングベースのキュレーション手法を採用している。さらに、完全なUIサンドボックス環境を用いたビジュアルコード合成のための自動化されたLLM駆動パイプラインを備えたマルチモーダル評価フレームワークを導入する。我々の詳細な分析により、V-GameGymがコード生成の精度と実践的なゲーム開発ワークフローの間のギャップを効果的に埋め、ビジュアルプログラミングとインタラクティブ要素生成のための定量化可能な品質指標を提供することが明らかになった。

English

Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.

V-GameGym: コード大規模言語モデルのための視覚的ゲーム生成

V-GameGym: Visual Game Generation for Code Large Language Models

要旨

Support