CRAWLDoc: Ein Datensatz für robustes Ranking bibliografischer Dokumente

Zusammenfassung

Publikationsdatenbanken sind auf die präzise Extraktion von Metadaten aus verschiedenen Webquellen angewiesen, doch Unterschiede in Web-Layouts und Datenformaten stellen Herausforderungen für Metadatenanbieter dar. Dieses Papier stellt CRAWLDoc vor, eine neue Methode zur kontextuellen Bewertung verknüpfter Webdokumente. Ausgehend von der URL einer Publikation, wie z. B. einem digitalen Objektbezeichner, ruft CRAWLDoc die Landingpage und alle verknüpften Webressourcen ab, einschließlich PDFs, ORCID-Profilen und ergänzenden Materialien. Es integriert diese Ressourcen zusammen mit Ankertexten und URLs in eine einheitliche Darstellung. Zur Bewertung von CRAWLDoc haben wir einen neuen, manuell annotierten Datensatz von 600 Publikationen von sechs führenden Verlagen in der Informatik erstellt. Unsere Methode CRAWLDoc zeigt eine robuste und layoutunabhängige Bewertung relevanter Dokumente über Verlage und Datenformate hinweg. Sie legt die Grundlage für eine verbesserte Metadatenextraktion aus Webdokumenten mit verschiedenen Layouts und Formaten. Unser Quellcode und der Datensatz sind unter https://github.com/FKarl/CRAWLDoc verfügbar.

English

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.