ScreenCoder: 모듈형 멀티모달 에이전트를 통한 프론트엔드 자동화를 위한 시각-코드 생성 기술의 발전

초록

사용자 인터페이스(UI) 디자인을 프론트엔드 코드로 자동 변환하는 것은 소프트웨어 개발 속도를 가속화하고 디자인 워크플로우를 민주화하는 데 있어 상당한 잠재력을 가지고 있다. 최근의 대규모 언어 모델(LLM)들이 텍스트-코드 생성에서 진전을 보여왔지만, 많은 기존 접근법들은 자연어 프롬프트에만 의존하여 공간적 레이아웃과 시각적 디자인 의도를 포착하는 데 한계가 있다. 반면, 실제 UI 개발은 본질적으로 다중 모드(multimodal)로 이루어지며, 종종 시각적 스케치나 목업에서 시작된다. 이러한 격차를 해결하기 위해, 우리는 UI-to-코드 생성을 세 가지 해석 가능한 단계(grounding, planning, generation)로 수행하는 모듈형 다중 에이전트 프레임워크를 소개한다. Grounding 에이전트는 시각-언어 모델을 사용하여 UI 컴포넌트를 감지하고 라벨링하며, planning 에이전트는 프론트엔드 엔지니어링 사전 지식을 활용하여 계층적 레이아웃을 구성하고, generation 에이전트는 적응형 프롬프트 기반 합성을 통해 HTML/CSS 코드를 생성한다. 이 설계는 종단간(end-to-end) 블랙박스 방법보다 견고성, 해석 가능성, 정확도를 향상시킨다. 더 나아가, 우리는 이 프레임워크를 확장하여 대규모 이미지-코드 쌍을 자동으로 생성하는 확장 가능한 데이터 엔진으로 발전시켰다. 이러한 합성 예제를 사용하여 오픈소스 VLM을 미세 조정하고 강화함으로써 UI 이해와 코드 품질에서 상당한 향상을 이끌어냈다. 광범위한 실험을 통해 우리의 접근법이 레이아웃 정확도, 구조적 일관성, 코드 정확성에서 최첨단 성능을 달성함을 입증했다. 우리의 코드는 https://github.com/leigest519/ScreenCoder에서 공개적으로 이용 가능하다.

English

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

ScreenCoder: 모듈형 멀티모달 에이전트를 통한 프론트엔드 자동화를 위한 시각-코드 생성 기술의 발전

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

초록

Support