WebVIA：基於網頁的視覺語言代理框架，實現可互動且可驗證的使用者介面轉程式碼生成

摘要

使用者介面（UI）開發需要將設計稿轉換為功能性程式碼，這個過程至今仍重複性高且耗費人力。儘管近期視覺語言模型（VLM）能自動化實現 UI-to-Code 生成，但其僅能產生缺乏互動性的靜態 HTML/CSS/JavaScript 佈局。為解決此問題，我們提出首個具備代理能力的互動式 UI-to-Code 生成與驗證框架 WebVIA。該框架包含三大組件：1）用於捕捉多狀態 UI 截圖的探索代理；2）生成可執行互動程式碼的 UI2Code 模型；3）驗證互動功能的檢測模組。實驗結果表明，WebVIA-Agent 相較通用代理（如 Gemini-2.5-Pro）能實現更穩定精準的 UI 探索。此外，我們微調後的 WebVIA-UI2Code 模型在生成可執行互動的 HTML/CSS/JavaScript 程式碼方面顯著提升，於互動式與靜態 UI2Code 基準測試中均超越其基礎模型。相關程式碼與模型已開源於 https://zheny2751-dotcom.github.io/webvia.github.io/{https://webvia.github.io}。

English

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at https://zheny2751-dotcom.github.io/webvia.github.io/{https://webvia.github.io}.