ChatPaper.aiChatPaper

ScreenCoder:透過模組化多模態代理推進視覺到程式碼生成的前端自動化

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

July 30, 2025
作者: Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue
cs.AI

摘要

自動化將使用者介面(UI)設計轉換為前端程式碼,對於加速軟體開發和普及設計工作流程具有重要意義。儘管近期的大型語言模型(LLMs)在文字到程式碼生成方面取得了進展,但許多現有方法僅依賴自然語言提示,限制了其在捕捉空間佈局和視覺設計意圖方面的有效性。相比之下,實際的UI開發本質上是多模態的,通常從視覺草圖或模型開始。為解決這一差距,我們引入了一個模組化的多代理框架,該框架在三個可解釋的階段執行UI到程式碼的生成:基礎、規劃和生成。基礎代理使用視覺語言模型來檢測和標記UI元件,規劃代理利用前端工程先驗構建層次化佈局,生成代理則通過自適應提示合成生成HTML/CSS程式碼。這一設計在魯棒性、可解釋性和保真度方面優於端到端的黑箱方法。此外,我們將該框架擴展為一個可擴展的資料引擎,自動生成大規模的圖像-程式碼對。利用這些合成示例,我們微調並強化了一個開源的視覺語言模型,在UI理解和程式碼質量方面取得了顯著提升。大量實驗表明,我們的方法在佈局準確性、結構連貫性和程式碼正確性方面達到了最先進的性能。我們的程式碼已公開於https://github.com/leigest519/ScreenCoder。
English
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
PDF723July 31, 2025