PlayCoder：让LLM生成的GUI代码可交互化

摘要

大型语言模型（LLMs）在代码生成方面已取得显著成果，但其生成图形用户界面（GUI）应用程序（尤其是游戏）的能力仍未得到充分研究。现有基准主要通过测试用例评估正确性，这种方法对GUI应用而言存在不足，因为这类系统具有交互性、事件驱动特性，且需要在用户操作序列中实现正确的状态转换。因此其评估应关注交互流程和UI逻辑，而非仅关注通过/失败结果。为研究该问题，我们推出PlayEval——一个基于43个多语言Python/TypeScript/JavaScript GUI应用构建的仓库感知基准。与先前难以适配桌面环境的GUI基准不同，PlayEval涵盖六大GUI应用类别，并直接支持代码生成评估。我们进一步提出Play@k指标，用于衡量在k个生成候选方案中是否至少有一个能无逻辑错误地完成端到端运行。为支持可靠评估，我们开发了基于LLM的智能体PlayTester，可执行任务导向的GUI流程测试并自动检测逻辑违规。对10个前沿代码LLM的实验表明，尽管编译通过率较高，但其Play@3得分接近零，暴露出在生成逻辑正确GUI应用方面的重大缺陷。针对此局限，我们提出多智能体仓库感知框架PlayCoder，通过闭环方式生成、评估并迭代修复GUI应用代码。该框架显著提升了开源与闭源模型的功能正确性和语义对齐度，最高可实现38.1%的Exec@3和20.3%的Play@3。案例研究进一步表明，该方法能发现传统指标遗漏的静默逻辑错误，并通过针对性修改实现修复。

English

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.