ChatPaper.aiChatPaper

ComProScanner:基於多智能體的科學文獻成分-性能結構化數據提取框架

ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

October 23, 2025
作者: Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni
cs.AI

摘要

自各類預訓練大語言模型問世以來,從科學文本中提取結構化知識的方法相較傳統機器學習或自然語言處理技術已發生革命性變化。儘管取得這些進展,能夠讓用戶對科學文獻提取結果進行構建、驗證和可視化的易用自動化工具仍然稀缺。為此,我們開發了ComProScanner——一個自主多智能體平台,可協助提取、驗證、分類和可視化機器可讀的化學成分與性質,並整合期刊論文的合成數據以創建綜合數據庫。針對陶瓷壓電材料及其對應壓電應變係數(d33)缺乏大規模數據集的現狀,我們使用100篇期刊論文對比評估了10種不同大語言模型(包括開源和專有模型)提取高度複雜成分的能力。DeepSeek-V3-0324以0.82的顯著總體準確率優於所有模型。該框架提供了一個簡單易用、用戶友好的現成工具包,用於從文獻中挖掘高度複雜的實驗數據,以構建機器學習或深度學習數據集。
English
Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.
PDF22December 2, 2025