ScreenCoder: モジュール型マルチモーダルエージェントによるフロントエンド自動化のための視覚からコード生成の進展

要旨

ユーザーインターフェース（UI）デザインをフロントエンドコードに自動変換することは、ソフトウェア開発の加速とデザインワークフローの民主化において大きな可能性を秘めている。近年の大規模言語モデル（LLM）はテキストからコードへの生成において進展を示しているが、既存の多くのアプローチは自然言語プロンプトに依存しており、空間的なレイアウトや視覚的なデザイン意図を捉える効果が限られている。一方、実際のUI開発は本質的にマルチモーダルであり、視覚的なスケッチやモックアップから始まることが多い。このギャップを埋めるため、我々はUIからコードへの生成を3つの解釈可能な段階（グラウンディング、プランニング、生成）で実行するモジュール型マルチエージェントフレームワークを提案する。グラウンディングエージェントは視覚言語モデルを使用してUIコンポーネントを検出しラベル付けし、プランニングエージェントはフロントエンドエンジニアリングの事前知識を用いて階層的なレイアウトを構築し、生成エージェントは適応型プロンプトベースの合成によりHTML/CSSコードを生成する。この設計により、エンドツーエンドのブラックボックス手法に比べて堅牢性、解釈可能性、忠実性が向上する。さらに、我々はこのフレームワークを拡張し、大規模な画像とコードのペアを自動生成するスケーラブルなデータエンジンを構築した。これらの合成例を使用して、オープンソースの視覚言語モデルを微調整し強化し、UI理解とコード品質において顕著な向上を実現した。広範な実験により、我々のアプローチがレイアウト精度、構造的一貫性、コードの正確性において最先端の性能を達成することが示された。我々のコードはhttps://github.com/leigest519/ScreenCoderで公開されている。

English

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

ScreenCoder: モジュール型マルチモーダルエージェントによるフロントエンド自動化のための視覚からコード生成の進展

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

要旨

Support