大規模言語モデルは超人的な化学者なのか？

要旨

大規模言語モデル（LLMs）は、人間の言語を処理し、明示的に訓練されていないタスクを実行する能力により、広く注目を集めています。これは、テキスト形式で頻繁に存在する小さく多様なデータセットに直面している化学科学にとって関連性があります。LLMsはこれらの問題に対処する可能性を示しており、化学的特性の予測、反応の最適化、さらには自律的に実験を設計・実施するためにますます活用されています。しかし、LLMsの化学的推論能力についての体系的な理解はまだ非常に限られており、モデルを改善し潜在的な害を軽減するためにはこれが必要です。ここでは、最先端のLLMsの化学知識と推論能力を人間の化学者の専門知識に対して厳密に評価するために設計された自動化フレームワーク「ChemBench」を紹介します。化学科学の幅広い分野にわたる7,000以上の質問-回答ペアをキュレーションし、主要なオープンソースおよびクローズドソースのLLMsを評価した結果、最良のモデルが平均して最良の人間の化学者を上回ることがわかりました。ただし、モデルは人間の専門家にとって簡単な一部の化学的推論タスクに苦戦し、化学物質の安全性プロファイルに関する過信した誤解を招く予測を提供することがあります。これらの発見は、LLMsが化学タスクで驚くべき熟練度を示す一方で、化学科学における安全性と有用性を向上させるためのさらなる研究が重要であるという二重の現実を強調しています。また、化学カリキュラムの適応の必要性を示し、安全で有用なLLMsを改善するための評価フレームワークの継続的な開発の重要性を強調しています。

English

Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously. However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce "ChemBench," an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists. We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals' safety profiles. These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs.

大規模言語モデルは超人的な化学者なのか？

Are large language models superhuman chemists?

要旨

Support