ChatPaper.aiChatPaper

Design2Code:我们离自动化前端工程有多远?

Design2Code: How Far Are We From Automating Front-End Engineering?

March 5, 2024
作者: Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, Diyi Yang
cs.AI

摘要

生成式人工智能在近年来取得了快速进展,在多模态理解和代码生成方面实现了前所未有的能力。这可以实现一种新的前端开发范式,其中多模态语言模型可以直接将视觉设计转换为代码实现。在这项工作中,我们将其形式化为一个Design2Code任务,并进行全面的基准测试。具体而言,我们手动策划了一个由484个多样化的现实世界网页构成的基准测试集,并开发了一组自动评估指标,以评估当前多模态语言模型生成代码实现的能力,这些代码实现可以直接呈现为给定参考网页,输入为屏幕截图。我们还结合全面的人工评估来补充自动评估指标。我们开发了一套多模态提示方法,并展示了它们在GPT-4V和Gemini Pro Vision上的有效性。我们进一步微调了一个开源的Design2Code-18B模型,成功匹配了Gemini Pro Vision的性能。人工评估和自动指标显示,与其他模型相比,GPT-4V在这项任务中表现最佳。此外,注释者认为GPT-4V生成的网页在视觉外观和内容方面可以替代原始参考网页的情况占49%;也许令人惊讶的是,在64%的情况下,GPT-4V生成的网页被认为比原始参考网页更好。我们的细分指标显示,开源模型在从输入网页中召回视觉元素和生成正确布局设计方面大多落后,而文本内容和着色等方面可以通过适当的微调大幅改善。
English
Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.
PDF982December 15, 2024