ComProScanner:一种基于多智能体的科学文献成分-性能结构化数据提取框架
ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature
October 23, 2025
作者: Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni
cs.AI
摘要
自各类预训练大语言模型问世以来,从科学文本中提取结构化知识的技术相比传统机器学习或自然语言处理方法发生了革命性变化。尽管取得了这些进展,能够支持用户对科学文献提取结果进行数据集构建、验证与可视化的易用自动化工具仍然稀缺。为此,我们开发了ComProScanner——一个自主多智能体平台,可实现机器可读的化学成分与性能的提取、验证、分类及可视化,并与期刊论文中的合成数据相集成,以构建综合性数据库。我们针对100篇期刊论文,以10种不同的开源及专有大语言模型对该框架进行了评估,旨在提取与陶瓷压电材料相关的高度复杂成分及其对应的压电应变系数(d33),此举源于此类材料大规模数据集的匮乏。DeepSeek-V3-0324模型以0.82的显著总体准确率优于所有模型。该框架为从海量文献中提取高度复杂的实验数据,以构建机器学习或深度学习数据集,提供了一个简洁、用户友好且即插即用的工具包。
English
Since the advent of various pre-trained large language models, extracting
structured knowledge from scientific text has experienced a revolutionary
change compared with traditional machine learning or natural language
processing techniques. Despite these advances, accessible automated tools that
allow users to construct, validate, and visualise datasets from scientific
literature extraction remain scarce. We therefore developed ComProScanner, an
autonomous multi-agent platform that facilitates the extraction, validation,
classification, and visualisation of machine-readable chemical compositions and
properties, integrated with synthesis data from journal articles for
comprehensive database creation. We evaluated our framework using 100 journal
articles against 10 different LLMs, including both open-source and proprietary
models, to extract highly complex compositions associated with ceramic
piezoelectric materials and corresponding piezoelectric strain coefficients
(d33), motivated by the lack of a large dataset for such materials.
DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of
0.82. This framework provides a simple, user-friendly, readily-usable package
for extracting highly complex experimental data buried in the literature to
build machine learning or deep learning datasets.