CRAWLDoc：一個用於書目文件穩健排序的數據集

摘要

出版物數據庫依賴於從多樣化的網絡來源中精確提取元數據，然而網頁佈局和數據格式的差異給元數據提供者帶來了挑戰。本文介紹了CRAWLDoc，一種用於鏈接網頁文檔上下文排序的新方法。從出版物的URL（如數字對象識別符）出發，CRAWLDoc檢索登陸頁面及所有鏈接的網絡資源，包括PDF文件、ORCID個人資料和補充材料。它將這些資源連同錨文本和URL嵌入到一個統一的表示中。為評估CRAWLDoc，我們創建了一個新的、手動標註的數據集，包含來自計算機科學領域六家頂級出版商的600篇出版物。我們的方法CRAWLDoc展示了跨出版商和數據格式的相關文檔的穩健且獨立於佈局的排序能力。它為從具有各種佈局和格式的網頁文檔中改進元數據提取奠定了基礎。我們的源代碼和數據集可在https://github.com/FKarl/CRAWLDoc訪問。

English

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.