WikiWeb2M：一个基于页面级别的多模态维基百科数据集

摘要

网页一直是语言和视觉-语言任务的丰富资源。然而，只有网页的部分内容被保留下来：图像-标题对、长文本文章或原始HTML，从未同时存在于一个地方。因此，网页任务受到了较少关注，结构化的图像-文本数据被低估利用。为了研究多模态网页理解，我们引入了维基百科网页2M（WikiWeb2M）套件；这是第一个保留了页面中所有图像、文本和结构数据的套件。WikiWeb2M可用于诸如页面描述生成、章节总结和上下文图像标题等任务。

English

Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page. WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.

WikiWeb2M：一个基于页面级别的多模态维基百科数据集

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

摘要

Support