Design2Code: フロントエンドエンジニアリングの自動化はどこまで進んでいるか？

要旨

近年、生成AIは急速な進歩を遂げ、マルチモーダル理解やコード生成において前例のない能力を達成しています。これにより、マルチモーダルLLMが視覚的なデザインを直接コード実装に変換するという新しいフロントエンド開発のパラダイムが可能になります。本研究では、これをDesign2Codeタスクとして形式化し、包括的なベンチマークを行います。具体的には、484の多様な実世界のウェブページをテストケースとして手動でキュレーションし、スクリーンショットを入力として与えられた参照ウェブページを直接レンダリングするコード実装を現在のマルチモーダルLLMがどれだけうまく生成できるかを評価するための自動評価指標を開発します。また、自動指標を補完するために、包括的な人間評価も行います。マルチモーダルプロンプティング手法のスイートを開発し、GPT-4VとGemini Pro Visionでの有効性を示します。さらに、Gemini Pro Visionの性能に匹敵するオープンソースのDesign2Code-18Bモデルをファインチューニングします。人間評価と自動指標の両方で、GPT-4Vが他のモデルと比較してこのタスクで最も優れた性能を示すことがわかります。さらに、アノテーターは、GPT-4Vが生成したウェブページが視覚的な外観と内容において元の参照ウェブページを49%のケースで置き換え可能であると考えています。そして、驚くべきことに、64%のケースでGPT-4Vが生成したウェブページは元の参照ウェブページよりも優れていると評価されています。私たちの細分化された指標は、オープンソースモデルが主に入力ウェブページから視覚要素を想起し、正しいレイアウトデザインを生成する点で遅れをとっている一方、テキスト内容や配色などの側面は適切なファインチューニングによって大幅に改善できることを示しています。

English

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

Design2Code: フロントエンドエンジニアリングの自動化はどこまで進んでいるか？

Design2Code: How Far Are We From Automating Front-End Engineering?

要旨

Support