PlayCoder：讓LLM生成的GUI程式碼可即時試玩

摘要

大型語言模型在程式碼生成領域已取得顯著成果，但其生成圖形使用者介面應用程式（特別是遊戲）的能力仍缺乏系統性研究。現有基準主要透過測試案例評估正確性，這種方法對GUI應用存在侷限性——這類系統具有互動性、事件驅動特性，且需在使用者操作序列中維持正確的狀態轉換。因此其評估應考量互動流程與UI邏輯，而非僅關注通過/失敗結果。為研究此問題，我們提出PlayEval：一個基於43個多語言GUI應用程式（涵蓋Python、TypeScript和JavaScript）建構的儲存庫感知基準。有別於難以適配桌面環境的既有GUI基準，PlayEval覆蓋六大GUI應用類別，並直接支援程式碼生成評估。我們進一步提出Play@k指標，用於衡量生成的k個候選程式中是否至少有一個能從頭到尾無邏輯錯誤地執行。為實現可靠評估，我們開發了PlayTester——基於LLM的智慧代理，可執行任務導向的GUI流程測試並自動檢測邏輯違規。對10個前沿程式碼LLM的實驗表明，儘管編譯成功率很高，但其Play@3得分接近零，暴露出在生成邏輯正確的GUI應用方面存在重大缺陷。為解決此問題，我們提出PlayCoder：一個多智慧體、儲存庫感知的框架，能以閉環方式生成、評估並迭代修復GUI應用程式碼。PlayCoder顯著提升了開源與閉源模型的功能正確性與語義對齊能力，最高可達到38.1%的Exec@3與20.3%的Play@3。案例研究進一步表明，該框架能發現傳統指標遺漏的靜默邏輯錯誤，並透過定向編輯進行修復。

English

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

PlayCoder：讓LLM生成的GUI程式碼可即時試玩

PlayCoder: Making LLM-Generated GUI Code Playable

摘要

Support