多層次多模態網頁理解的生成任務套件

摘要

網頁一直是視覺語言和僅語言任務的豐富、可擴展資源。然而，只有網頁的部分內容被保留下來：圖像標題配對、長文本文章或原始 HTML，從未集中在一個地方。因此，網頁任務受到較少關注，結構化的圖像-文本數據被遺憾地未被充分利用。為了研究多模態網頁理解，我們引入了包含 2M 頁面的維基百科網頁套件（WikiWeb2M）。我們在三個生成任務上驗證了其效用：頁面描述生成、章節摘要和情境圖像標題生成。我們設計了一種新穎的注意機制 Prefix Global，它選擇最相關的圖像和文本內容作為全域標記，以便關注網頁其餘部分的上下文。通過使用頁面結構來分離這些標記，它比全注意力機制表現更好，並具有較低的計算複雜度。實驗表明，來自 WikiWeb2M 的新標註相較於先前工作的數據，改善了任務性能。我們還對序列長度、輸入特徵和模型大小進行了消融實驗。

English

Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) of 2M pages. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Experiments show that the new annotations from WikiWeb2M improve task performance compared to data from prior work. We also include ablations on sequence length, input features, and model size.

多層次多模態網頁理解的生成任務套件

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

摘要

Support