一套用于多层次多模态网页理解的生成式任务套件
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding
May 5, 2023
作者: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo
cs.AI
摘要
网页一直是视觉-语言和仅语言任务的丰富、可扩展的资源。然而,只有网页的部分内容被保留下来:图像-标题对、长文本文章或原始HTML,从未集中在一个地方。因此,网页任务受到了较少关注,结构化的图像-文本数据也被低估了。为了研究多模态网页理解,我们引入了包含200万个页面的维基百科网页套件(WikiWeb2M)。我们验证了它在三个生成任务上的实用性:页面描述生成、章节摘要和上下文图像字幕。我们设计了一种新颖的注意力机制Prefix Global,它选择最相关的图像和文本内容作为全局令牌,以便关注网页的其余内容以获取上下文。通过使用页面结构来分离这些令牌,它比全注意力机制表现更好,并具有更低的计算复杂度。实验表明,WikiWeb2M中的新注释相对于先前工作的数据改善了任务性能。我们还对序列长度、输入特征和模型大小进行了消融实验。
English
Webpages have been a rich, scalable resource for vision-language and language
only tasks. Yet only pieces of webpages are kept: image-caption pairs, long
text articles, or raw HTML, never all in one place. Webpage tasks have
resultingly received little attention and structured image-text data left
underused. To study multimodal webpage understanding, we introduce the
Wikipedia Webpage suite (WikiWeb2M) of 2M pages. We verify its utility on three
generative tasks: page description generation, section summarization, and
contextual image captioning. We design a novel attention mechanism Prefix
Global, which selects the most relevant image and text content as global tokens
to attend to the rest of the webpage for context. By using page structure to
separate such tokens, it performs better than full attention with lower
computational complexity. Experiments show that the new annotations from
WikiWeb2M improve task performance compared to data from prior work. We also
include ablations on sequence length, input features, and model size.