ComProScanner: 科学文献からの組成-特性構造化データ抽出のためのマルチエージェント基盤

要旨

様々な事前学習済み大規模言語モデルの登場以来、科学テキストからの構造化知識抽出は、従来の機械学習や自然言語処理技術と比べて革命的な変化を経験している。しかしながら、科学文献からの抽出結果を用いてユーザーがデータセットを構築・検証・可視化することを可能にする利用しやすい自動化ツールは、依然として不足している。そこで我々はComProScannerを開発した。これは、機械可読な化学組成と特性の抽出・検証・分類・可視化を促進する自律型マルチエージェントプラットフォームであり、包括的なデータベース作成のために論文からの合成データと統合されている。我々は、セラミック圧電材料と対応する圧電歪み係数（d33）に関連する高度に複雑な組成を抽出するために、100報の論文を用いて10種類のオープンソースおよびプロプライエタリモデルを含むLLMを比較評価した。これは、こうした材料に関する大規模データセットの不足に動機づけられたものである。DeepSeek-V3-0324は0.82の顕著な総合精度で全てのモデルを上回った。本フレームワークは、文献に埋もれた高度に複雑な実験データを抽出して機械学習や深層学習のデータセットを構築するための、シンプルでユーザーフレンドリーな即時利用可能なパッケージを提供する。

English

Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.