MOLE:基於大型語言模型的科學論文元數據提取與驗證
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
May 26, 2025
作者: Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
cs.AI
摘要
元數據提取對於數據集的編目和保存至關重要,它促進了有效的研究發現與可重複性,尤其是在當前科學研究呈指數級增長的背景下。雖然Masader(Alyafeai等人,2021)為從阿拉伯語自然語言處理數據集的學術文章中提取廣泛的元數據屬性奠定了基礎,但它主要依賴於手動註釋。在本文中,我們介紹了MOLE,這是一個利用大型語言模型(LLMs)自動從涵蓋非阿拉伯語數據集的科學論文中提取元數據屬性的框架。我們的模式驅動方法處理多種輸入格式的完整文檔,並結合了穩健的驗證機制以確保輸出的一致性。此外,我們引入了一個新的基準來評估此任務的研究進展。通過對上下文長度、少樣本學習和網絡瀏覽集成進行系統分析,我們展示了現代LLMs在自動化此任務方面展現出令人鼓舞的成果,強調了未來進一步改進工作以確保一致且可靠性能的必要性。我們向研究社區發布了代碼:https://github.com/IVUL-KAUST/MOLE 和數據集:https://huggingface.co/datasets/IVUL-KAUST/MOLE。
English
Metadata extraction is essential for cataloging and preserving datasets,
enabling effective research discovery and reproducibility, especially given the
current exponential growth in scientific research. While Masader (Alyafeai et
al.,2021) laid the groundwork for extracting a wide range of metadata
attributes from Arabic NLP datasets' scholarly articles, it relies heavily on
manual annotation. In this paper, we present MOLE, a framework that leverages
Large Language Models (LLMs) to automatically extract metadata attributes from
scientific papers covering datasets of languages other than Arabic. Our
schema-driven methodology processes entire documents across multiple input
formats and incorporates robust validation mechanisms for consistent output.
Additionally, we introduce a new benchmark to evaluate the research progress on
this task. Through systematic analysis of context length, few-shot learning,
and web browsing integration, we demonstrate that modern LLMs show promising
results in automating this task, highlighting the need for further future work
improvements to ensure consistent and reliable performance. We release the
code: https://github.com/IVUL-KAUST/MOLE and dataset:
https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.Summary
AI-Generated Summary