Tx-LLM:用於治療學的大型語言模型
Tx-LLM: A Large Language Model for Therapeutics
June 10, 2024
作者: Juan Manuel Zambrano Chaves, Eric Wang, Tao Tu, Eeshit Dhaval Vaishnav, Byron Lee, S. Sara Mahdavi, Christopher Semturs, David Fleet, Vivek Natarajan, Shekoofeh Azizi
cs.AI
摘要
開發治療方法是一個漫長且昂貴的過程,需要滿足許多不同的標準,而能加快這個過程的人工智慧模型將是無價的。然而,目前大多數人工智慧方法僅涉及一組狹義定義的任務,通常僅限於特定領域。為了彌合這一差距,我們引入了Tx-LLM,這是一個通用的大型語言模型(LLM),從PaLM-2微調而來,它編碼了關於各種治療模式的知識。Tx-LLM使用一組包含709個數據集的訓練集,涵蓋了涵蓋了藥物發現管道各個階段的66個任務。Tx-LLM使用單一權重集同時處理各種化學或生物實體(小分子、蛋白質、核酸、細胞系、疾病)與自由文本,使其能夠預測各種相關屬性,並在43個任務中實現了與最先進性能(SOTA)相競爭的表現,並在22個任務中超越了SOTA。在這些任務中,Tx-LLM特別強大,並在平均分上超越了最佳表現,尤其是在結合分子SMILES表示和文本(如細胞系名稱或疾病名稱)的任務中,可能是由於預訓練期間學習的上下文。我們觀察到在涉及不同藥物類型的任務之間存在積極的轉移證據(例如,涉及小分子的任務和涉及蛋白質的任務),並研究了模型大小、領域微調和提示策略對性能的影響。我們認為Tx-LLM代表了一個重要的步驟,朝著編碼生物化學知識的LLM發展,並可能在整個藥物發現開發管道中擔任端對端工具的未來角色。
English
Developing therapeutics is a lengthy and expensive process that requires the
satisfaction of many different criteria, and AI models capable of expediting
the process would be invaluable. However, the majority of current AI approaches
address only a narrowly defined set of tasks, often circumscribed within a
particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large
language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about
diverse therapeutic modalities. Tx-LLM is trained using a collection of 709
datasets that target 66 tasks spanning various stages of the drug discovery
pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide
variety of chemical or biological entities(small molecules, proteins, nucleic
acids, cell lines, diseases) interleaved with free-text, allowing it to predict
a broad range of associated properties, achieving competitive with
state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on
22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class
performance on average for tasks combining molecular SMILES representations
with text such as cell line names or disease names, likely due to context
learned during pretraining. We observe evidence of positive transfer between
tasks with diverse drug types (e.g.,tasks involving small molecules and tasks
involving proteins), and we study the impact of model size, domain finetuning,
and prompting strategies on performance. We believe Tx-LLM represents an
important step towards LLMs encoding biochemical knowledge and could have a
future role as an end-to-end tool across the drug discovery development
pipeline.Summary
AI-Generated Summary