다층적 멀티모달 웹페이지 이해를 위한 생성 작업 모음

초록

웹페이지는 시각-언어 및 언어 전용 작업을 위한 풍부하고 확장 가능한 자원으로 여겨져 왔다. 그러나 웹페이지의 일부만이 보존되는 경우가 대부분이다: 이미지-캡션 쌍, 긴 텍스트 기사, 또는 원시 HTML 등이 따로따로 저장되며, 이 모든 것이 한곳에 모여 있는 경우는 거의 없다. 그 결과, 웹페이지 작업은 상대적으로 적은 관심을 받았으며, 구조화된 이미지-텍스트 데이터는 제대로 활용되지 못했다. 다중 모드 웹페이지 이해를 연구하기 위해, 우리는 200만 개의 페이지로 구성된 Wikipedia 웹페이지 스위트(WikiWeb2M)를 소개한다. 우리는 이 스위트의 유용성을 페이지 설명 생성, 섹션 요약, 그리고 맥락적 이미지 캡션 생성이라는 세 가지 생성 작업에서 검증한다. 우리는 가장 관련성이 높은 이미지와 텍스트 콘텐츠를 전역 토큰으로 선택하여 웹페이지의 나머지 부분에 맥락을 제공하는 새로운 주의 메커니즘인 Prefix Global을 설계했다. 페이지 구조를 활용하여 이러한 토큰을 분리함으로써, 이 메커니즘은 더 낮은 계산 복잡도로 완전 주의 메커니즘보다 더 나은 성능을 보인다. 실험 결과, WikiWeb2M의 새로운 주석은 기존 작업의 데이터에 비해 작업 성능을 향상시키는 것으로 나타났다. 또한, 우리는 시퀀스 길이, 입력 특징, 모델 크기에 대한 절제 실험도 포함시켰다.

English

Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) of 2M pages. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Experiments show that the new annotations from WikiWeb2M improve task performance compared to data from prior work. We also include ablations on sequence length, input features, and model size.

다층적 멀티모달 웹페이지 이해를 위한 생성 작업 모음

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

초록

Support