CRAWLDoc: 書誌文書のロバストなランキングのためのデータセット

要旨

出版物データベースは、多様なウェブソースからの正確なメタデータ抽出に依存しているが、ウェブレイアウトやデータ形式のばらつきがメタデータ提供者にとって課題となっている。本論文では、リンクされたウェブ文書の文脈的ランキングを行う新しい手法であるCRAWLDocを紹介する。CRAWLDocは、デジタルオブジェクト識別子（DOI）などの出版物のURLを起点として、ランディングページおよびPDF、ORCIDプロファイル、補足資料などのすべてのリンクされたウェブリソースを取得する。これらのリソースを、アンカーテキストやURLとともに統一された表現に埋め込む。CRAWLDocの評価のために、コンピュータサイエンス分野の主要な6つの出版社から600件の出版物を手動でラベル付けした新しいデータセットを作成した。CRAWLDocは、出版社やデータ形式を超えて関連文書をレイアウトに依存せずに堅牢にランク付けすることを実証している。これにより、様々なレイアウトや形式のウェブ文書からのメタデータ抽出の改善の基盤が築かれる。ソースコードとデータセットはhttps://github.com/FKarl/CRAWLDocでアクセス可能である。

English

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.

CRAWLDoc: 書誌文書のロバストなランキングのためのデータセット

CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

要旨

Support