豐富監督下的視覺語言預訓練增強

摘要

我們提出了一種名為「具強監督的截圖預訓練（S4）」的新穎預訓練範式，用於視覺語言模型，利用大規模網頁截圖渲染的數據。使用網頁截圖可以開啟一個視覺和文本線索的寶庫，這些線索在使用圖像文本對時並不存在。在S4中，我們利用HTML元素的固有樹狀結構層次和空間定位，精心設計了10個具有大規模標註數據的預訓練任務。這些任務類似於不同領域的下游任務，而且標註成本低廉。我們證明，與當前截圖預訓練目標相比，我們創新的預訓練方法顯著提升了圖像到文本模型在九個不同且熱門的下游任務中的表現 - 在表格檢測方面提高了高達76.1％，在小部件標題方面至少提高了1％。

English

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

豐富監督下的視覺語言預訓練增強

Enhancing Vision-Language Pre-training with Rich Supervisions

摘要

Support