利用WebSight數據集解鎖將Web截圖轉換為HTML代碼

摘要

在網頁開發中使用視覺語言模型（VLMs）提供了一個有前途的策略，可以提高效率並解開無代碼解決方案的障礙：通過提供 UI 的截圖或草圖，VLM 可以生成代碼以重現它，例如在 HTML 這樣的語言中。儘管在各種任務上 VLMs 取得了進展，但將截圖轉換為對應的 HTML 的具體挑戰卻鮮少被探討。我們認為這主要是由於缺乏合適的高質量數據集所致。本研究介紹了 WebSight，這是一個由 200 萬對 HTML 代碼和它們對應的截圖組成的合成數據集。我們在我們的數據集上對基礎 VLM 進行微調，並展示了將網頁截圖轉換為功能性 HTML 代碼的能力。為了加速這一領域的研究，我們將 WebSight 開源。

English

Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.

利用WebSight數據集解鎖將Web截圖轉換為HTML代碼

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

摘要

Support