ChatPaper.aiChatPaper

从HAL出版物存储库中提取文本和结构化数据

Harvesting Textual and Structured Data from the HAL Publication Repository

July 30, 2024
作者: Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary
cs.AI

摘要

HAL(Hyper Articles en Ligne)是法国的国家出版物存储库,被大多数高等教育和研究机构用于其开放科学政策。作为一个数字图书馆,它是一个丰富的学术文献存储库,但其用于高级研究的潜力尚未充分利用。我们提出了HALvest,这是一个独特的数据集,弥合了引用网络和HAL上提交的论文全文之间的差距。我们通过过滤HAL以获取学术出版物来构建我们的数据集,结果约为70万份文档,涵盖了13个确定的领域中的34种语言,适用于语言模型训练,并产生了约165亿个标记(其中80亿为法语,70亿为英语,是最常见的语言)。我们将每篇论文的元数据转换为引用网络,生成一个有向异构图。该图包括在HAL上具有唯一标识的作者,以及所有开放提交的论文及其引用。我们使用数据集为作者归属提供了一个基准,实施了一系列用于链接预测的图表示学习的最新模型,并讨论了我们生成的知识图结构的实用性。
English
HAL (Hyper Articles en Ligne) is the French national publication repository, used by most higher education and research organizations for their open science policy. As a digital library, it is a rich repository of scholarly documents, but its potential for advanced research has been underutilized. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text of papers submitted on HAL. We craft our dataset by filtering HAL for scholarly publications, resulting in approximately 700,000 documents, spanning 34 languages across 13 identified domains, suitable for language model training, and yielding approximately 16.5 billion tokens (with 8 billion in French and 7 billion in English, the most represented languages). We transform the metadata of each paper into a citation network, producing a directed heterogeneous graph. This graph includes uniquely identified authors on HAL, as well as all open submitted papers, and their citations. We provide a baseline for authorship attribution using the dataset, implement a range of state-of-the-art models in graph representation learning for link prediction, and discuss the usefulness of our generated knowledge graph structure.

Summary

AI-Generated Summary

PDF221November 28, 2024