ChatPaper.aiChatPaper

大型語言模型是否是超人類化學家?

Are large language models superhuman chemists?

April 1, 2024
作者: Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, Caroline T. Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Santiago Miret, Tim Hoffmann, Fabian Alexander Kreth, Michael Ringleb, Nicole Roesner, Ulrich S. Schubert, Leanne M. Stafast, Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka
cs.AI

摘要

大型語言模型(LLMs)因其處理人類語言並執行未經明確訓練的任務的能力而引起廣泛興趣。這對化學科學具有相關性,因為該領域面臨著小型且多樣的數據集問題,這些數據集通常以文本形式存在。LLMs已顯示出在應對這些問題方面具有潛力,並越來越多地被利用來預測化學性質、優化反應,甚至自主設計和執行實驗。然而,我們對LLMs的化學推理能力仍然只有非常有限的系統性了解,這將需要以改進模型並減輕潛在危害為目的。在這裡,我們介紹了一個名為「ChemBench」的自動化框架,旨在嚴格評估最先進的LLMs的化學知識和推理能力,並與人類化學家的專業知識進行比較。我們為化學科學的各個子領域精心挑選了超過7,000個問答對,評估了領先的開源和封閉源LLMs,發現在我們的研究中,最佳模型平均表現優於最優秀的人類化學家。然而,這些模型在一些對人類專家來說輕而易舉的化學推理任務上遇到困難,並提供過於自信且具有誤導性的預測,例如有關化學物質的安全性檔案。這些發現強調了一個雙重現實,即儘管LLMs在化學任務上表現出色,但進一步的研究對於增強它們在化學科學中的安全性和實用性至關重要。我們的研究結果還表明,需要對化學課程進行調整,並強調繼續發展評估框架以改進安全且有用的LLMs的重要性。
English
Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously. However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce "ChemBench," an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists. We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals' safety profiles. These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs.

Summary

AI-Generated Summary

PDF191November 26, 2024