Tx-LLM: 치료제를 위한 대형 언어 모델

초록

치료제 개발은 많은 다양한 기준을 충족시켜야 하는 길고 비용이 많이 드는 과정이며, 이 과정을 가속화할 수 있는 AI 모델은 매우 귀중할 것입니다. 그러나 현재 대부분의 AI 접근법은 특정 도메인 내에서만 제한적으로 정의된 작업들만을 다루고 있습니다. 이러한 격차를 해소하기 위해, 우리는 다양한 치료 방식을 이해하는 지식을 인코딩한 PaLM-2에서 미세 조정된 범용 대형 언어 모델(LLM)인 Tx-LLM을 소개합니다. Tx-LLM은 약물 발견 파이프라인의 다양한 단계에 걸친 66개 작업을 대상으로 하는 709개의 데이터셋을 사용하여 훈련되었습니다. 단일 가중치 세트를 사용하여 Tx-LLM은 소분자, 단백질, 핵산, 세포주, 질병 등 다양한 화학적 또는 생물학적 개체들과 자유 텍스트를 동시에 처리하며, 이를 통해 광범위한 관련 특성을 예측할 수 있습니다. 이는 66개 작업 중 43개에서 최신 기술(SOTA)과 경쟁력 있는 성능을 달성하고, 22개에서는 SOTA를 능가합니다. 특히, Tx-LLM은 세포주 이름이나 질병 이름과 같은 텍스트와 분자 SMILES 표현을 결합한 작업에서 특히 강력하며, 평균적으로 최고 수준의 성능을 능가합니다. 이는 사전 훈련 중 학습된 문맥 덕분일 가능성이 높습니다. 우리는 다양한 약물 유형(예: 소분자 관련 작업과 단백질 관련 작업) 간의 긍정적인 전이 증거를 관찰하며, 모델 크기, 도메인 미세 조정, 프롬프트 전략이 성능에 미치는 영향을 연구합니다. 우리는 Tx-LLM이 생화학적 지식을 인코딩하는 LLM의 중요한 진전을 나타내며, 약물 발견 개발 파이프라인 전반에 걸친 종단 간 도구로서의 미래 역할을 할 수 있을 것이라고 믿습니다.

English

Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

Tx-LLM: 치료제를 위한 대형 언어 모델

Tx-LLM: A Large Language Model for Therapeutics

초록

Support