MOLE: 대형 언어 모델을 활용한 과학 논문의 메타데이터 추출 및 검증

초록

메타데이터 추출은 데이터셋의 분류 및 보존에 필수적이며, 특히 현재 과학 연구의 기하급수적 성장을 고려할 때 효과적인 연구 발견과 재현성을 가능하게 합니다. Masader(Alyafeai et al., 2021)는 아랍어 NLP 데이터셋의 학술 논문에서 다양한 메타데이터 속성을 추출하기 위한 기반을 마련했지만, 이는 수동 주석에 크게 의존합니다. 본 논문에서는 대규모 언어 모델(LLMs)을 활용하여 아랍어 이외의 언어 데이터셋을 다루는 과학 논문에서 메타데이터 속성을 자동으로 추출하는 MOLE 프레임워크를 소개합니다. 우리의 스키마 기반 방법론은 여러 입력 형식의 전체 문서를 처리하고 일관된 출력을 위한 강력한 검증 메커니즘을 통합합니다. 또한, 이 작업에 대한 연구 진행 상황을 평가하기 위한 새로운 벤치마크를 도입합니다. 컨텍스트 길이, 퓨샷 학습, 웹 브라우징 통합에 대한 체계적인 분석을 통해 현대의 LLMs가 이 작업을 자동화하는 데 유망한 결과를 보여주며, 일관적이고 신뢰할 수 있는 성능을 보장하기 위해 향후 추가 개선 작업이 필요함을 강조합니다. 우리는 연구 커뮤니티를 위해 코드(https://github.com/IVUL-KAUST/MOLE)와 데이터셋(https://huggingface.co/datasets/IVUL-KAUST/MOLE)을 공개합니다.

English

Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.