MOLE：基于大语言模型的科学论文元数据提取与验证系统

摘要

元数据提取对于数据集的编目与保存至关重要，它促进了有效的研究发现与可重复性，尤其是在当前科学研究呈指数级增长的背景下。尽管Masader（Alyafeai等人，2021）为从阿拉伯语自然语言处理数据集的学术文章中提取广泛的元数据属性奠定了基础，但其主要依赖于人工标注。本文中，我们提出了MOLE框架，该框架利用大型语言模型（LLMs）自动从涵盖非阿拉伯语数据集的科学论文中提取元数据属性。我们的模式驱动方法能够处理多种输入格式的完整文档，并整合了稳健的验证机制以确保输出的一致性。此外，我们引入了一个新的基准来评估此任务的研究进展。通过对上下文长度、少样本学习及网络浏览集成进行系统分析，我们展示了现代LLMs在自动化此任务上展现出的潜力，同时强调了未来进一步改进以确保性能一致性与可靠性的必要性。我们向研究社区发布了代码：https://github.com/IVUL-KAUST/MOLE 和数据集：https://huggingface.co/datasets/IVUL-KAUST/MOLE。

English

Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.

MOLE：基于大语言模型的科学论文元数据提取与验证系统

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

摘要

Support