GPT 模型在化學領域究竟能做些什麼？對八項任務的全面基準測試

摘要

具有強大自然語言處理能力的大型語言模型（LLMs）已經出現，並迅速應用於科學、金融和軟體工程等各種領域。然而，LLMs在推動化學領域的能力仍不清楚。本文建立了一個包含8個實際化學任務的全面基準，包括1）名稱預測，2）性質預測，3）產量預測，4）反應預測，5）逆合成（從產物預測反應物），6）基於文本的分子設計，7）分子標題，以及8）試劑選擇。我們的分析基於廣泛認可的數據集，包括BBBP、Tox21、PubChem、USPTO和ChEBI，有助於在實際化學背景下廣泛探索LLMs的能力。我們對三個GPT模型（GPT-4、GPT-3.5和Davinci-003）在零樣本和少樣本內文學習設置中進行評估，使用精心選擇的示範例子和特別製作的提示。我們調查的主要結果為：1）在三個評估的模型中，GPT-4的表現優於其他兩個模型；2）GPT模型在需要對分子SMILES表示進行精確理解的任務中（如反應預測和逆合成）表現較差；3）GPT模型在文本相關的解釋任務（如分子標題）中展現出強大能力；以及4）當應用於可轉換為分類或排名任務的化學問題時，如性質預測和產量預測，GPT模型表現出與傳統機器學習模型相當或更好的性能。

English

Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been rapidly applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper,we establish a comprehensive benchmark containing 8 practical chemistry tasks, including 1) name prediction, 2) property prediction, 3) yield prediction, 4) reaction prediction, 5) retrosynthesis (prediction of reactants from products), 6)text-based molecule design, 7) molecule captioning, and 8) reagent selection. Our analysis draws on widely recognized datasets including BBBP, Tox21, PubChem, USPTO, and ChEBI, facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Three GPT models (GPT-4, GPT-3.5,and Davinci-003) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. The key results of our investigation are 1) GPT-4 outperforms the other two models among the three evaluated; 2) GPT models exhibit less competitive performance in tasks demanding precise understanding of molecular SMILES representation, such as reaction prediction and retrosynthesis;3) GPT models demonstrate strong capabilities in text-related explanation tasks such as molecule captioning; and 4) GPT models exhibit comparable or better performance to classical machine learning models when applied to chemical problems that can be transformed into classification or ranking tasks, such as property prediction, and yield prediction.

GPT 模型在化學領域究竟能做些什麼？對八項任務的全面基準測試

What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks

摘要

Support