PlayCoder: LLM生成GUIコードをプレイ可能にする

要旨

大規模言語モデル（LLM）はコード生成において強力な成果を上げているが、GUIアプリケーション、特にゲームの生成能力については十分に研究されていない。既存のベンチマークは主にテストケースを通じて正確性を評価するが、GUIアプリケーションは対話的でイベント駆動型であり、一連のユーザー操作にわたる正しい状態遷移を必要とするため、この評価手法は不適切である。したがって、その評価は合格/不合格の結果だけでなく、インタラクションフローとUIロジックを考慮すべきである。この問題を研究するため、我々はPython、TypeScript、JavaScriptで書かれた43の多言語GUIアプリケーションから構築したリポジトリ対応ベンチマーク「PlayEval」を提案する。従来のGUIベンチマークがデスクトップ環境への適応が困難であったのに対し、PlayEvalは6つの主要GUIアプリケーションカテゴリを網羅し、コード生成評価を直接サポートする。さらに、k個生成された候補のうち少なくとも1つが論理エラーなくエンドツーエンドでプレイ可能かどうかを測定する指標「Play@k」を提案する。信頼性の高い評価を支援するため、タスク指向のGUIプレイスルーを実行し、論理違反を自動検出するLLMベースのエージェント「PlayTester」を開発した。10の最先端コードLLMを用いた実験では、高いコンパイル成功率にもかかわらずPlay@3がほぼゼロとなり、論理的に正しいGUIアプリケーション生成における重大な弱点が明らかになった。この課題に対処するため、リポジトリ対応のマルチエージェントフレームワーク「PlayCoder」を提案する。これはGUIアプリケーションコードを生成、評価、反復的に修正するクローズドループを実現する。PlayCoderはオープンソースおよびクローズドソースモデルにおいて、機能的正確性と意味的整合性の両方を大幅に改善し、最大38.1%のExec@3と20.3%のPlay@3を達成した。ケーススタディではさらに、従来の指標では見逃されていたサイレント論理バグを特定し、対象を絞った編集によって修正できることを示す。

English

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

PlayCoder: LLM生成GUIコードをプレイ可能にする

PlayCoder: Making LLM-Generated GUI Code Playable

要旨

Support