WAFFLE:用于自动化前端开发的多模态模型
WAFFLE: Multi-Modal Model for Automated Front-End Development
October 24, 2024
作者: Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan
cs.AI
摘要
Web开发涉及将UI设计转化为功能性网页,这对于初学者和经验丰富的开发人员来说都可能很困难,因为HTML的层级结构和样式复杂。虽然大型语言模型(LLMs)在生成源代码方面表现出潜力,但在UI到HTML代码生成中仍存在两个主要挑战:(1)有效地为LLMs表示HTML的层级结构,以及(2)弥合UI设计的视觉特性与HTML代码的文本格式之间的差距。为了解决这些挑战,我们引入了一种名为Waffle的新微调策略,该策略利用结构感知注意机制来提高LLMs对HTML结构的理解,并采用对比微调方法来使LLMs对UI图像和HTML代码的理解保持一致。通过Waffle进行微调的模型在我们的新基准测试WebSight-Test和现有基准测试Design2Code上展现出高达9.00个百分点更高的HTML匹配度,0.0982更高的CW-SSIM,32.99更高的CLIP,以及27.12个百分点更高的LLEM,优于当前的微调方法。
English
Web development involves turning UI designs into functional webpages, which
can be difficult for both beginners and experienced developers due to the
complexity of HTML's hierarchical structures and styles. While Large Language
Models (LLMs) have shown promise in generating source code, two major
challenges persist in UI-to-HTML code generation: (1) effectively representing
HTML's hierarchical structure for LLMs, and (2) bridging the gap between the
visual nature of UI designs and the text-based format of HTML code. To tackle
these challenges, we introduce Waffle, a new fine-tuning strategy that uses a
structure-aware attention mechanism to improve LLMs' understanding of HTML's
structure and a contrastive fine-tuning approach to align LLMs' understanding
of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp
(percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP,
and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing
benchmark Design2Code, outperforming current fine-tuning methods.Summary
AI-Generated Summary