丰富监督下的视觉-语言预训练增强

摘要

我们提出了使用屏幕截图的强监督预训练（S4）- 一种新颖的视觉-语言模型预训练范式，利用大规模网络截图渲染数据。使用网络截图可以解锁大量视觉和文本线索，这些线索在使用图像-文本对时是不存在的。在S4中，我们利用HTML元素的固有树形结构层次和空间定位，精心设计了10个预训练任务，使用大规模注释数据。这些任务类似于不同领域的下游任务，并且注释获取成本低廉。我们证明，与当前截图预训练目标相比，我们创新的预训练方法显著提升了图像到文本模型在九个不同且流行的下游任务中的性能-在表格检测上提高了高达76.1％，在小部件字幕上至少提高了1％。

English

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

丰富监督下的视觉-语言预训练增强

Enhancing Vision-Language Pre-training with Rich Supervisions

摘要

Support