풍부한 감독을 통한 비전-언어 사전 학습 강화

초록

우리는 대규모 웹 스크린샷 렌더링 데이터를 활용한 비전-언어 모델을 위한 새로운 사전 학습 패러다임인 Strongly Supervised pre-training with ScreenShots(S4)를 제안합니다. 웹 스크린샷을 사용함으로써 이미지-텍스트 쌍에서는 얻을 수 없는 풍부한 시각적 및 텍스트 단서를 활용할 수 있습니다. S4에서는 HTML 요소의 고유한 트리 구조 계층과 공간적 위치 정보를 활용하여 대규모 주석 데이터를 기반으로 10가지 사전 학습 작업을 신중하게 설계했습니다. 이러한 작업들은 다양한 도메인에서의 다운스트림 작업과 유사하며, 주석을 얻는 데 드는 비용이 저렴합니다. 우리는 현재의 스크린샷 사전 학습 목표와 비교하여, 우리의 혁신적인 사전 학습 방법이 9가지 다양한 인기 다운스트림 작업에서 이미지-텍스트 모델의 성능을 크게 향상시킴을 입증했습니다. 특히, 테이블 탐지(Table Detection)에서는 최대 76.1%의 성능 향상을, 위젯 캡셔닝(Widget Captioning)에서는 최소 1%의 성능 향상을 보였습니다.

English

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

풍부한 감독을 통한 비전-언어 사전 학습 강화

Enhancing Vision-Language Pre-training with Rich Supervisions

초록

Support