MatTools:面向材料科学工具的大语言模型基准测试
MatTools: Benchmarking Large Language Models for Materials Science Tools
May 16, 2025
作者: Siyu Liu, Jiamin Xu, Beilin Ye, Bo Hu, David J. Srolovitz, Tongqi Wen
cs.AI
摘要
大型语言模型(LLMs)正日益应用于材料科学领域,涵盖文献理解、性能预测、材料发现及合金设计等方面。与此同时,一系列基于物理的计算方法已被开发,用于计算材料属性。本文提出了一种基准应用,旨在通过生成并安全执行基于此类物理计算材料科学软件包的代码,来评估LLMs解答材料科学问题的能力。MatTools构建于两个互补组件之上:一个材料模拟工具问答(QA)基准和一个现实世界工具使用基准。我们设计了一种自动化方法,以高效收集现实世界材料科学工具使用案例。QA基准源自pymatgen(Python材料基因组学)代码库及文档,包含69,225对QA,用于评估LLM理解材料科学工具的能力。现实世界基准则包含49项任务(138个子任务),要求生成用于材料性能计算的功能性Python代码。我们对多种LLMs的评估得出了三个关键发现:(1)通才胜过专才;(2)AI了解AI;(3)简单即佳。MatTools为评估和提升LLMs在材料科学工具应用中的能力提供了一个标准化框架,促进了开发更有效的AI系统以服务于材料科学及一般科学研究。
English
Large language models (LLMs) are increasingly applied to materials science
questions, including literature comprehension, property prediction, materials
discovery and alloy design. At the same time, a wide range of physics-based
computational approaches have been developed in which materials properties can
be calculated. Here, we propose a benchmark application to evaluate the
proficiency of LLMs to answer materials science questions through the
generation and safe execution of codes based on such physics-based
computational materials science packages. MatTools is built on two
complementary components: a materials simulation tool question-answer (QA)
benchmark and a real-world tool-usage benchmark. We designed an automated
methodology to efficiently collect real-world materials science tool-use
examples. The QA benchmark, derived from the pymatgen (Python Materials
Genomics) codebase and documentation, comprises 69,225 QA pairs that assess the
ability of an LLM to understand materials science tools. The real-world
benchmark contains 49 tasks (138 subtasks) requiring the generation of
functional Python code for materials property calculations. Our evaluation of
diverse LLMs yields three key insights: (1)Generalists outshine
specialists;(2)AI knows AI; and (3)Simpler is better. MatTools provides a
standardized framework for assessing and improving LLM capabilities for
materials science tool applications, facilitating the development of more
effective AI systems for materials science and general scientific research.Summary
AI-Generated Summary