CRAWLDoc：一个用于文献文档鲁棒排序的数据集

摘要

出版物数据库依赖于从多样化的网络资源中准确提取元数据，然而网页布局和数据格式的差异给元数据提供商带来了挑战。本文介绍了CRAWLDoc，一种用于链接网页文档上下文排序的新方法。从出版物的URL（如数字对象标识符）出发，CRAWLDoc获取着陆页及所有链接的网络资源，包括PDF文件、ORCID个人资料和补充材料。它将这些资源与锚文本及URL一同嵌入到一个统一的表示中。为了评估CRAWLDoc，我们创建了一个新的、手工标注的数据集，包含来自计算机科学领域六大顶级出版商的600篇出版物。我们的方法CRAWLDoc展示了跨出版商和数据格式的稳健且独立于布局的相关文档排序能力，为从具有多种布局和格式的网页文档中改进元数据提取奠定了基础。我们的源代码和数据集可通过https://github.com/FKarl/CRAWLDoc访问。

English

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.