ChatPaper.aiChatPaper

使用大型語言模型設計蛋白質:增強和比較分析

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

August 12, 2024
作者: Kamyar Zeinalipour, Neda Jamshidi, Monica Bianchini, Marco Maggini, Marco Gori
cs.AI

摘要

預訓練的大型語言模型(LLMs)已在傳統自然語言處理(NLP)任務中展示出顯著的能力,如摘要和實體識別。本文探討LLMs在生成高質量蛋白序列方面的應用。具體而言,我們採用了一系列預訓練的LLMs,包括Mistral-7B1、Llama-2-7B2、Llama-3-8B3和gemma-7B4,以生成有效的蛋白序列。所有這些模型都是公開可用的。與該領域先前的工作不同,我們的方法使用了一個相對較小的數據集,包括42,000個不同的人類蛋白序列。我們對這些模型進行了重新訓練,以處理與蛋白相關的數據,確保生成具有生物可行性的蛋白結構。我們的研究結果表明,即使數據有限,適應後的模型展現出與已建立的蛋白專注模型(如ProGen系列、ProtGPT2和ProLLaMA)相當的效率,這些模型是在數百萬個蛋白序列上進行訓練的。為驗證和量化我們模型的性能,我們進行了使用標準指標(如pLDDT、RMSD、TM-score和REU)的比較分析。此外,我們承諾將所有四個模型的訓練版本公開,促進計算生物學領域的更大透明度和合作。
English
Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.

Summary

AI-Generated Summary

PDF81November 28, 2024