MOLE:基于大语言模型的科学论文元数据提取与验证系统
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
May 26, 2025
作者: Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
cs.AI
摘要
元数据提取对于数据集的编目与保存至关重要,它促进了有效的研究发现与可重复性,尤其是在当前科学研究呈指数级增长的背景下。尽管Masader(Alyafeai等人,2021)为从阿拉伯语自然语言处理数据集的学术文章中提取广泛的元数据属性奠定了基础,但其主要依赖于人工标注。本文中,我们提出了MOLE框架,该框架利用大型语言模型(LLMs)自动从涵盖非阿拉伯语数据集的科学论文中提取元数据属性。我们的模式驱动方法能够处理多种输入格式的完整文档,并整合了稳健的验证机制以确保输出的一致性。此外,我们引入了一个新的基准来评估此任务的研究进展。通过对上下文长度、少样本学习及网络浏览集成进行系统分析,我们展示了现代LLMs在自动化此任务上展现出的潜力,同时强调了未来进一步改进以确保性能一致性与可靠性的必要性。我们向研究社区发布了代码:https://github.com/IVUL-KAUST/MOLE 和数据集:https://huggingface.co/datasets/IVUL-KAUST/MOLE。
English
Metadata extraction is essential for cataloging and preserving datasets,
enabling effective research discovery and reproducibility, especially given the
current exponential growth in scientific research. While Masader (Alyafeai et
al.,2021) laid the groundwork for extracting a wide range of metadata
attributes from Arabic NLP datasets' scholarly articles, it relies heavily on
manual annotation. In this paper, we present MOLE, a framework that leverages
Large Language Models (LLMs) to automatically extract metadata attributes from
scientific papers covering datasets of languages other than Arabic. Our
schema-driven methodology processes entire documents across multiple input
formats and incorporates robust validation mechanisms for consistent output.
Additionally, we introduce a new benchmark to evaluate the research progress on
this task. Through systematic analysis of context length, few-shot learning,
and web browsing integration, we demonstrate that modern LLMs show promising
results in automating this task, highlighting the need for further future work
improvements to ensure consistent and reliable performance. We release the
code: https://github.com/IVUL-KAUST/MOLE and dataset:
https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.Summary
AI-Generated Summary