Design2Code：我們離自動化前端工程有多遠？

摘要

生成式人工智慧近年來取得了快速進展，在多模態理解和程式碼生成方面實現了前所未有的能力。這可以促成一種新的前端開發範式，其中多模態語言模型可能直接將視覺設計轉換為程式碼實現。在這項工作中，我們將這視為一個Design2Code任務並進行全面的基準測試。具體來說，我們手動精心挑選了484個多樣的現實世界網頁作為測試案例，並開發了一組自動評估指標，以評估目前多模態語言模型生成程式碼實現的能力，直接呈現在給定參考網頁上，並以螢幕截圖作為輸入。我們還補充了全面的人工評估指標。我們開發了一套多模態提示方法，展示了它們對GPT-4V和Gemini Pro Vision的有效性。我們進一步微調了一個開源的Design2Code-18B模型，成功匹配了Gemini Pro Vision的性能。人類評估和自動指標都顯示，與其他模型相比，GPT-4V在這項任務中表現最佳。此外，標註者認為，GPT-4V生成的網頁在視覺外觀和內容方面可以在49%的情況下取代原始參考網頁；也許令人驚訝的是，在64%的情況下，GPT-4V生成的網頁被認為優於原始參考網頁。我們的細緻分解指標顯示，開源模型在從輸入網頁中召回視覺元素和生成正確佈局設計方面大多落後，而像文本內容和著色這樣的方面可以通過適當的微調大幅改善。

English

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

Design2Code：我們離自動化前端工程有多遠？

Design2Code: How Far Are We From Automating Front-End Engineering?

摘要

Support