リッチな教師信号を用いた視覚-言語事前学習の強化

要旨

我々は、大規模なウェブスクリーンショットレンダリングデータを用いたVision-Languageモデルのための新しい事前学習パラダイムであるStrongly Supervised pre-training with ScreenShots（S4）を提案します。ウェブスクリーンショットを使用することで、画像とテキストのペアだけでは得られない視覚的およびテキストの手がかりの宝庫を活用できます。S4では、HTML要素の内在的なツリー構造階層と空間的ローカライゼーションを利用して、大規模なアノテーションデータを用いた10の事前学習タスクを慎重に設計します。これらのタスクは、さまざまなドメインにわたる下流タスクに類似しており、アノテーションを低コストで取得できます。我々は、現在のスクリーンショット事前学習の目的と比較して、この革新的な事前学習方法が、9つの多様で一般的な下流タスクにおける画像からテキストへのモデルの性能を大幅に向上させることを実証しました。具体的には、Table Detectionでは最大76.1%、Widget Captioningでは少なくとも1%の改善が見られました。

English

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

リッチな教師信号を用いた視覚-言語事前学習の強化

Enhancing Vision-Language Pre-training with Rich Supervisions

要旨

Support