PlayCoder: LLM 생성 GUI 코드를 실행 가능하게 만들기

초록

대규모 언어 모델(LLM)은 코드 생성에서 강력한 성과를 보였으나, GUI 애플리케이션, 특히 게임 생성 능력에 대한 연구는 여전히 부족합니다. 기존 벤치마크는 주로 테스트 케이스를 통해 정확성을 평가하는데, GUI 애플리케이션은 상호작용적이고 이벤트 주도적이며 일련의 사용자 작업에 걸쳐 정확한 상태 전환이 필요하기 때문에 이러한 평가 방식은 부적합합니다. 따라서 GUI 애플리케이션의 평가는 단순한 통과/실패 결과보다 상호작용 흐름과 UI 논리를 고려해야 합니다. 이 문제를 연구하기 위해 우리는 Python, TypeScript, JavaScript로 작성된 43개의 다국어 GUI 애플리케이션으로 구성된 저장소 인식(repository-aware) 벤치마크인 PlayEval을 소개합니다. 데스크톱 환경에 적용하기 어려운 기존 GUI 벤치마크와 달리, PlayEval은 6가지 주요 GUI 애플리케이션 범주를 포괄하며 코드 생성 평가를 직접 지원합니다. 우리는 더 나아가 생성된 *k*개의 후보 코드 중 적어도 하나가 논리 오류 없이 끝까지 실행 가능한지를 측정하는 지표인 Play@k를 제안합니다. 신뢰할 수 있는 평가를 지원하기 위해, 작업 지향적 GUI 실행을 수행하고 논리 위반을 자동으로 감지하는 LLM 기반 에이전트인 PlayTester를 개발했습니다. 10개의 최첨단 코드 LLM에 대한 실험 결과, 높은 컴파일률에도 불구하고 이들의 Play@3 점수가 거의 0%에 가까워 논리적으로 정확한 GUI 애플리케이션 생성에 중대한 약점이 있음이 드러났습니다. 이 한계를 해결하기 위해, 우리는 생성, 평가, 반복적 수정을 폐쇄 루프(closed loop) 방식으로 수행하는 다중 에이전트 저장소 인식 프레임워크인 PlayCoder를 제시합니다. PlayCoder는 오픈소스 및 클로즈드소스 모델 모두에서 기능적 정확성과 의미적 일치도를 크게 향상시켜 최대 38.1%의 Exec@3 및 20.3%의 Play@3에 도달했습니다. 사례 연구를 통해 이 프레임워크가 기존 지표에서 놓친 침묵적 논리 버그(silent logic bug)를 발견하고 표적 수정(targeted edit)을 통해 해결할 수 있음을 추가로 보여줍니다.

English

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

PlayCoder: LLM 생성 GUI 코드를 실행 가능하게 만들기

PlayCoder: Making LLM-Generated GUI Code Playable

초록

Support