一套用于多层次多模态网页理解的生成式任务套件

摘要

网页一直是视觉-语言和仅语言任务的丰富、可扩展的资源。然而，只有网页的部分内容被保留下来：图像-标题对、长文本文章或原始HTML，从未集中在一个地方。因此，网页任务受到了较少关注，结构化的图像-文本数据也被低估了。为了研究多模态网页理解，我们引入了包含200万个页面的维基百科网页套件（WikiWeb2M）。我们验证了它在三个生成任务上的实用性：页面描述生成、章节摘要和上下文图像字幕。我们设计了一种新颖的注意力机制Prefix Global，它选择最相关的图像和文本内容作为全局令牌，以便关注网页的其余内容以获取上下文。通过使用页面结构来分离这些令牌，它比全注意力机制表现更好，并具有更低的计算复杂度。实验表明，WikiWeb2M中的新注释相对于先前工作的数据改善了任务性能。我们还对序列长度、输入特征和模型大小进行了消融实验。

English

Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) of 2M pages. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Experiments show that the new annotations from WikiWeb2M improve task performance compared to data from prior work. We also include ablations on sequence length, input features, and model size.

一套用于多层次多模态网页理解的生成式任务套件

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

摘要

Support