ChatPaper.aiChatPaper

從HAL出版物存儲庫中提取文本和結構化數據

Harvesting Textual and Structured Data from the HAL Publication Repository

July 30, 2024
作者: Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary
cs.AI

摘要

HAL(Hyper Articles en Ligne)是法國的國家出版庫,被大多數高等教育和研究機構用於其開放科學政策。作為一個數字圖書館,它是一個豐富的學術文件存儲庫,但其用於高級研究的潛力尚未得到充分利用。我們提出了HALvest,這是一個獨特的數據集,橋接了引文網絡和在HAL上提交的論文的全文之間的差距。我們通過篩選HAL中的學術出版物來構建我們的數據集,結果約有70萬份文件,跨越13個確定的領域,涵蓋34種語言,適合語言模型訓練,產生約165億個標記(其中80億個為法語,70億個為英語,是最常見的語言)。我們將每篇論文的元數據轉換為引文網絡,生成一個有向異構圖。該圖包括在HAL上獨特識別的作者,以及所有開放提交的論文及其引文。我們使用數據集為作者歸屬提供了一個基準,應用一系列最先進的圖表示學習模型進行鏈接預測,並討論我們生成的知識圖結構的實用性。
English
HAL (Hyper Articles en Ligne) is the French national publication repository, used by most higher education and research organizations for their open science policy. As a digital library, it is a rich repository of scholarly documents, but its potential for advanced research has been underutilized. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text of papers submitted on HAL. We craft our dataset by filtering HAL for scholarly publications, resulting in approximately 700,000 documents, spanning 34 languages across 13 identified domains, suitable for language model training, and yielding approximately 16.5 billion tokens (with 8 billion in French and 7 billion in English, the most represented languages). We transform the metadata of each paper into a citation network, producing a directed heterogeneous graph. This graph includes uniquely identified authors on HAL, as well as all open submitted papers, and their citations. We provide a baseline for authorship attribution using the dataset, implement a range of state-of-the-art models in graph representation learning for link prediction, and discuss the usefulness of our generated knowledge graph structure.

Summary

AI-Generated Summary

PDF221November 28, 2024