ウェブスクリーンショットをHTMLコードに変換する技術の解明：WebSightデータセットを用いて

要旨

ウェブ開発において視覚言語モデル（VLM）を活用することは、効率性を向上させ、ノーコードソリューションを実現するための有望な戦略を提供する。具体的には、UIのスクリーンショットやスケッチを提供することで、VLMがそれを再現するコード（例えばHTMLのような言語）を生成することが可能である。様々なタスクにおけるVLMの進展にもかかわらず、スクリーンショットを対応するHTMLに変換するという特定の課題は、ほとんど検討されていない。これは主に、適切で高品質なデータセットの欠如によるものと考えられる。本研究では、200万組のHTMLコードとそれに対応するスクリーンショットから構成される合成データセット「WebSight」を紹介する。このデータセットを用いて基礎的なVLMをファインチューニングし、ウェブページのスクリーンショットを機能的なHTMLコードに変換する能力を示す。この分野の研究を加速するため、WebSightをオープンソースとして公開する。

English

Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.

ウェブスクリーンショットをHTMLコードに変換する技術の解明：WebSightデータセットを用いて

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

要旨

Support