ScreenCoder：通过模块化多模态代理推进前端自动化中的视觉到代码生成

摘要

将用户界面（UI）设计自动转化为前端代码，对于加速软件开发进程和普及设计工作流程具有重大意义。尽管近期的大型语言模型（LLMs）在文本到代码生成方面取得了进展，但许多现有方法仅依赖自然语言提示，限制了其在捕捉空间布局和视觉设计意图方面的有效性。相比之下，实际中的UI开发本质上是多模态的，通常始于视觉草图或模型。为填补这一空白，我们引入了一个模块化的多智能体框架，该框架通过三个可解释的阶段执行UI到代码的生成：基础定位、规划与生成。基础定位智能体利用视觉语言模型检测并标注UI组件，规划智能体基于前端工程先验构建层次化布局，而生成智能体则通过自适应提示合成生成HTML/CSS代码。这一设计相较于端到端的黑箱方法，在鲁棒性、可解释性和保真度上均有提升。此外，我们将该框架扩展为一个可扩展的数据引擎，自动生成大规模图像-代码对。利用这些合成示例，我们对一个开源视觉语言模型进行了微调与强化，显著提升了UI理解与代码质量。大量实验证明，我们的方法在布局准确性、结构连贯性及代码正确性方面均达到了业界领先水平。我们的代码已公开于https://github.com/leigest519/ScreenCoder。

English

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

ScreenCoder：通过模块化多模态代理推进前端自动化中的视觉到代码生成

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

摘要

Support