Tx-LLM：用于治疗学的大型语言模型

摘要

开发治疗方法是一个漫长且昂贵的过程，需要满足许多不同的标准，而能够加快这一过程的人工智能模型将是无价的。然而，目前大多数人工智能方法只解决一个狭义任务集，通常局限于特定领域。为了弥合这一差距，我们引入了Tx-LLM，这是一个通用的大型语言模型（LLM），从PaLM-2微调而来，编码了关于多种治疗模式的知识。Tx-LLM使用一组包含709个数据集的训练集，涵盖了药物发现流程各个阶段的66项任务。Tx-LLM使用单一权重集同时处理各种化学或生物实体（小分子、蛋白质、核酸、细胞系、疾病）与自由文本交织在一起，使其能够预测广泛相关属性的范围，在66项任务中有43项达到了与最先进性能（SOTA）相竞争的水平，并在22项任务中超过了SOTA。在这些任务中，Tx-LLM特别强大，并且在结合分子SMILES表示和文本的任务中（如细胞系名称或疾病名称），平均超越了最佳表现，这可能是由于预训练期间学习的上下文。我们观察到在涉及小分子和蛋白质任务的多样化药物类型之间存在积极的任务转移证据，并研究了模型大小、领域微调和提示策略对性能的影响。我们相信Tx-LLM代表了向LLM编码生物化学知识迈出的重要一步，并且可能在整个药物发现开发流程中扮演端到端工具的未来角色。

English

Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

Tx-LLM：用于治疗学的大型语言模型

Tx-LLM: A Large Language Model for Therapeutics

摘要

Support