GPT 模型在化学领域究竟能做些什么？对八项任务的全面基准测试

摘要

具有强大自然语言处理能力的大型语言模型（LLMs）已经出现，并迅速应用于科学、金融和软件工程等各种领域。然而，LLMs推动化学领域发展的能力尚不清楚。本文建立了一个包含8个实际化学任务的全面基准，包括1）名称预测，2）属性预测，3）产量预测，4）反应预测，5）逆合成（从产物预测反应物），6）基于文本的分子设计，7）分子字幕，和8）试剂选择。我们的分析基于广泛认可的数据集，包括BBBP、Tox21、PubChem、USPTO和ChEBI，促进了对LLMs在实际化学背景下能力的广泛探索。我们评估了三个GPT模型（GPT-4、GPT-3.5和Davinci-003）在每个化学任务中的零-shot和少-shot上下文学习设置，使用精心选择的演示示例和特别设计的提示。我们调查的关键结果是1）在三个评估模型中，GPT-4表现优于其他两个模型；2）GPT模型在需要对分子SMILES表示进行精确理解的任务中表现较差，例如反应预测和逆合成；3）GPT模型在文本相关的解释任务中表现出强大能力，如分子字幕；4）GPT模型在应用于可转化为分类或排名任务的化学问题时，如属性预测和产量预测时，表现出与经典机器学习模型相媲美或更好的性能。

English

Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged and have been rapidly applied in various kinds of areas such as science, finance and software engineering. However, the capability of LLMs to advance the field of chemistry remains unclear. In this paper,we establish a comprehensive benchmark containing 8 practical chemistry tasks, including 1) name prediction, 2) property prediction, 3) yield prediction, 4) reaction prediction, 5) retrosynthesis (prediction of reactants from products), 6)text-based molecule design, 7) molecule captioning, and 8) reagent selection. Our analysis draws on widely recognized datasets including BBBP, Tox21, PubChem, USPTO, and ChEBI, facilitating a broad exploration of the capacities of LLMs within the context of practical chemistry. Three GPT models (GPT-4, GPT-3.5,and Davinci-003) are evaluated for each chemistry task in zero-shot and few-shot in-context learning settings with carefully selected demonstration examples and specially crafted prompts. The key results of our investigation are 1) GPT-4 outperforms the other two models among the three evaluated; 2) GPT models exhibit less competitive performance in tasks demanding precise understanding of molecular SMILES representation, such as reaction prediction and retrosynthesis;3) GPT models demonstrate strong capabilities in text-related explanation tasks such as molecule captioning; and 4) GPT models exhibit comparable or better performance to classical machine learning models when applied to chemical problems that can be transformed into classification or ranking tasks, such as property prediction, and yield prediction.

GPT 模型在化学领域究竟能做些什么？对八项任务的全面基准测试

What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks

摘要

Support