大型语言模型是否超越了人类化学家的能力?
Are large language models superhuman chemists?
April 1, 2024
作者: Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, Caroline T. Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Santiago Miret, Tim Hoffmann, Fabian Alexander Kreth, Michael Ringleb, Nicole Roesner, Ulrich S. Schubert, Leanne M. Stafast, Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka
cs.AI
摘要
大型语言模型(LLMs)因其处理人类语言及执行未经明确训练任务的能力而备受关注。这对于化学科学尤为重要,因为该领域面临数据集小且多样化的挑战,这些数据往往以文本形式存在。LLMs在解决这些问题上展现出潜力,并越来越多地被用于预测化学性质、优化反应,甚至自主设计与执行实验。然而,我们对LLMs在化学推理能力方面的理解仍非常有限,这限制了模型的改进及潜在危害的缓解。在此,我们引入了“ChemBench”,这是一个自动化框架,旨在严格评估最先进LLMs的化学知识和推理能力,并与人类化学家的专业知识进行对比。我们精心挑选了超过7,000个问题-答案对,涵盖化学科学的多个子领域,评估了领先的开放和闭源LLMs,发现最佳模型在平均水平上优于我们研究中表现最佳的人类化学家。然而,这些模型在某些化学推理任务上表现不佳,这些任务对人类专家来说却相对简单,并且它们提供了过于自信、具有误导性的预测,例如关于化学品安全性的评估。这些发现凸显了一个双重现实:尽管LLMs在化学任务中展现出显著的熟练度,但进一步的研究对于提升其在化学科学中的安全性和实用性至关重要。我们的研究结果还表明,需要对化学课程进行调整,并强调继续开发评估框架以改进安全且有用的LLMs的重要性。
English
Large language models (LLMs) have gained widespread interest due to their
ability to process human language and perform tasks on which they have not been
explicitly trained. This is relevant for the chemical sciences, which face the
problem of small and diverse datasets that are frequently in the form of text.
LLMs have shown promise in addressing these issues and are increasingly being
harnessed to predict chemical properties, optimize reactions, and even design
and conduct experiments autonomously. However, we still have only a very
limited systematic understanding of the chemical reasoning capabilities of
LLMs, which would be required to improve models and mitigate potential harms.
Here, we introduce "ChemBench," an automated framework designed to rigorously
evaluate the chemical knowledge and reasoning abilities of state-of-the-art
LLMs against the expertise of human chemists. We curated more than 7,000
question-answer pairs for a wide array of subfields of the chemical sciences,
evaluated leading open and closed-source LLMs, and found that the best models
outperformed the best human chemists in our study on average. The models,
however, struggle with some chemical reasoning tasks that are easy for human
experts and provide overconfident, misleading predictions, such as about
chemicals' safety profiles. These findings underscore the dual reality that,
although LLMs demonstrate remarkable proficiency in chemical tasks, further
research is critical to enhancing their safety and utility in chemical
sciences. Our findings also indicate a need for adaptations to chemistry
curricula and highlight the importance of continuing to develop evaluation
frameworks to improve safe and useful LLMs.Summary
AI-Generated Summary