MatTools: 材料科学ツールのための大規模言語モデルのベンチマーキング

要旨

大規模言語モデル（LLMs）は、文献理解、特性予測、材料発見、合金設計など、材料科学の課題にますます応用されている。同時に、材料特性を計算可能とする多様な物理ベースの計算手法が開発されてきた。本稿では、このような物理ベースの計算材料科学パッケージに基づくコードの生成と安全な実行を通じて、LLMsが材料科学の質問に答える能力を評価するためのベンチマークアプリケーションを提案する。MatToolsは、材料シミュレーションツールの質問応答（QA）ベンチマークと実世界のツール使用ベンチマークという2つの補完的なコンポーネントに基づいて構築されている。我々は、実世界の材料科学ツール使用例を効率的に収集するための自動化手法を設計した。pymatgen（Python Materials Genomics）のコードベースとドキュメントから派生したQAベンチマークは、LLMが材料科学ツールを理解する能力を評価する69,225のQAペアで構成されている。実世界のベンチマークは、材料特性計算のための機能的なPythonコードの生成を必要とする49のタスク（138のサブタスク）を含んでいる。多様なLLMsの評価から得られた3つの重要な知見は以下の通りである：（1）ジェネラリストはスペシャリストを凌駕する；（2）AIはAIを知る；（3）シンプルであることが優れている。MatToolsは、材料科学ツールアプリケーションにおけるLLMの能力を評価し改善するための標準化されたフレームワークを提供し、材料科学および一般的な科学研究のためのより効果的なAIシステムの開発を促進する。

English

Large language models (LLMs) are increasingly applied to materials science questions, including literature comprehension, property prediction, materials discovery and alloy design. At the same time, a wide range of physics-based computational approaches have been developed in which materials properties can be calculated. Here, we propose a benchmark application to evaluate the proficiency of LLMs to answer materials science questions through the generation and safe execution of codes based on such physics-based computational materials science packages. MatTools is built on two complementary components: a materials simulation tool question-answer (QA) benchmark and a real-world tool-usage benchmark. We designed an automated methodology to efficiently collect real-world materials science tool-use examples. The QA benchmark, derived from the pymatgen (Python Materials Genomics) codebase and documentation, comprises 69,225 QA pairs that assess the ability of an LLM to understand materials science tools. The real-world benchmark contains 49 tasks (138 subtasks) requiring the generation of functional Python code for materials property calculations. Our evaluation of diverse LLMs yields three key insights: (1)Generalists outshine specialists;(2)AI knows AI; and (3)Simpler is better. MatTools provides a standardized framework for assessing and improving LLM capabilities for materials science tool applications, facilitating the development of more effective AI systems for materials science and general scientific research.

MatTools: 材料科学ツールのための大規模言語モデルのベンチマーキング

MatTools: Benchmarking Large Language Models for Materials Science Tools

要旨

Support