Tx-LLM：治療分野向け大規模言語モデル

要旨

治療薬の開発は、多くの異なる基準を満たす必要がある長くて費用のかかるプロセスであり、このプロセスを加速できるAIモデルは非常に貴重です。しかし、現在のAIアプローチの大半は、特定の領域内に限定された狭く定義されたタスクセットにしか対応していません。このギャップを埋めるため、私たちはPaLM-2からファインチューニングされた汎用大規模言語モデル（LLM）であるTx-LLMを紹介します。Tx-LLMは、多様な治療モダリティに関する知識をエンコードしており、創薬パイプラインの様々な段階にまたがる66のタスクを対象とした709のデータセットを使用して訓練されています。単一の重みセットを使用して、Tx-LLMは、小分子、タンパク質、核酸、細胞株、疾患などの多様な化学的または生物学的エンティティを自由テキストと交互に処理し、広範な関連特性を予測することができます。これにより、66のタスクのうち43で最先端（SOTA）の性能に匹敵し、22でSOTAを上回る結果を達成しています。特に、Tx-LLMは、分子のSMILES表現と細胞株名や疾患名などのテキストを組み合わせたタスクにおいて、平均して最高クラスの性能を上回る強力な能力を示しています。これは、事前学習中に学習されたコンテキストによるものと考えられます。また、多様な薬物タイプ（例えば、小分子を含むタスクとタンパク質を含むタスク）間での正の転移の証拠を観察し、モデルサイズ、ドメインファインチューニング、およびプロンプト戦略が性能に与える影響を研究しました。私たちは、Tx-LLMが生化学的知識をエンコードするLLMに向けた重要な一歩を表しており、創薬開発パイプライン全体にわたるエンドツーエンドのツールとして将来の役割を果たす可能性があると考えています。

English

Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

Tx-LLM：治療分野向け大規模言語モデル

Tx-LLM: A Large Language Model for Therapeutics

要旨

Support