マルチレベル・マルチモーダルなウェブページ理解のための生成タスクスイート

要旨

ウェブページは、視覚と言語、および言語のみのタスクにおいて豊かでスケーラブルなリソースとなってきました。しかし、ウェブページの断片のみが保持されることが一般的です：画像とキャプションのペア、長文記事、または生のHTMLであり、これらが一箇所にまとめられることはありませんでした。その結果、ウェブページタスクはほとんど注目されておらず、構造化された画像とテキストのデータは十分に活用されていませんでした。マルチモーダルなウェブページ理解を研究するために、200万ページのWikipediaウェブページスイート（WikiWeb2M）を導入します。このスイートの有用性を、ページ説明生成、セクション要約、および文脈に基づく画像キャプション生成という3つの生成タスクで検証します。我々は、最も関連性の高い画像とテキストコンテンツをグローバルトークンとして選択し、残りのウェブページに注意を向ける新しい注意機構「Prefix Global」を設計しました。ページ構造を利用してこれらのトークンを分離することで、計算複雑性を低く抑えつつ、完全な注意機構よりも優れた性能を発揮します。実験結果から、WikiWeb2Mの新しいアノテーションが、従来の研究データと比較してタスク性能を向上させることが示されました。また、シーケンス長、入力特徴量、およびモデルサイズに関するアブレーションスタディも含めています。

English

Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) of 2M pages. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Experiments show that the new annotations from WikiWeb2M improve task performance compared to data from prior work. We also include ablations on sequence length, input features, and model size.

マルチレベル・マルチモーダルなウェブページ理解のための生成タスクスイート

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

要旨

Support