利用大型语言模型设计蛋白质：增强和比较分析

摘要

预训练的大型语言模型在传统自然语言处理（NLP）任务中展现出了相当大的能力，如摘要和实体识别。本文探讨了在高质量蛋白质序列生成中应用大型语言模型的方法。具体来说，我们采用了一系列预训练的大型语言模型，包括Mistral-7B1、Llama-2-7B2、Llama-3-8B3和gemma-7B4，用于生成有效的蛋白质序列。所有这些模型都是公开可用的。与该领域先前的研究不同，我们的方法利用了一个相对较小的数据集，包括42,000个不同的人类蛋白质序列。我们重新训练这些模型以处理与蛋白质相关的数据，确保生成具有生物学可行性的蛋白质结构。我们的研究结果表明，即使数据有限，经过调整的模型展现出了与已建立的以蛋白质为重点的模型（如ProGen系列、ProtGPT2和ProLLaMA）相媲美的效率，这些模型是在数百万蛋白质序列上进行训练的。为了验证和量化我们模型的性能，我们进行了比较分析，采用了标准指标，如pLDDT、RMSD、TM-score和REU。此外，我们承诺公开提供所有四个模型的训练版本，促进计算生物学领域更大的透明度和合作。

English

Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.

利用大型语言模型设计蛋白质：增强和比较分析

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

摘要

Support