MOLE：基於大型語言模型的科學論文元數據提取與驗證

摘要

元數據提取對於數據集的編目和保存至關重要，它促進了有效的研究發現與可重複性，尤其是在當前科學研究呈指數級增長的背景下。雖然Masader（Alyafeai等人，2021）為從阿拉伯語自然語言處理數據集的學術文章中提取廣泛的元數據屬性奠定了基礎，但它主要依賴於手動註釋。在本文中，我們介紹了MOLE，這是一個利用大型語言模型（LLMs）自動從涵蓋非阿拉伯語數據集的科學論文中提取元數據屬性的框架。我們的模式驅動方法處理多種輸入格式的完整文檔，並結合了穩健的驗證機制以確保輸出的一致性。此外，我們引入了一個新的基準來評估此任務的研究進展。通過對上下文長度、少樣本學習和網絡瀏覽集成進行系統分析，我們展示了現代LLMs在自動化此任務方面展現出令人鼓舞的成果，強調了未來進一步改進工作以確保一致且可靠性能的必要性。我們向研究社區發布了代碼：https://github.com/IVUL-KAUST/MOLE 和數據集：https://huggingface.co/datasets/IVUL-KAUST/MOLE。

English

Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.