ChatPaper.aiChatPaper

DNAGPT:用于多个DNA序列分析任务的通用预训练工具

DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks

July 11, 2023
作者: Daoan Zhang, Weitong Zhang, Bing He, Jianguo Zhang, Chenchen Qin, Jianhua Yao
cs.AI

摘要

GPT 系列的成功证明了 GPT 能够从序列中提取通用信息,从而使所有下游任务受益。这激励我们使用预训练模型来探索 DNA 序列中的隐藏信息。然而,在 DNA 序列分析中的数据和任务要求是复杂和多样的,因为 DNA 相关数据包括不同类型的信息,如序列、表达水平等,目前还没有专门针对这些特征设计的模型。因此,我们提出了 DNAGPT,这是一个通用的基础模型,预训练于来自 9 种物种的超过 100 亿个碱基对,可以针对任何 DNA 序列分析任务进行微调。我们的模型可以同时处理或输出 DNA 序列和数字。此外,我们独特的标记设计允许用户根据自己的任务要求设计提示,使其适用于任何类型的任务。我们已对我们的模型进行了分类、回归和生成任务的评估。我们展示了 DNAGPT 受益于预训练,因此可以为任何下游任务带来性能提升。我们的模型不仅是基因组分析领域的一次新尝试,还为基础模型在生物学中的应用提供了新方向。
English
The success of the GPT series proves that GPT can extract general information from sequences, thereby benefiting all downstream tasks. This motivates us to use pre-trained models to explore the hidden information in DNA sequences. However, data and task requirements in DNA sequence analysis are complexity and diversity as DNA relevant data includes different types of information, such as sequences, expression levels, etc, while there is currently no model specifically designed for these characteristics. Hereby, we present DNAGPT, a generalized foundation model pre-trained on over 10 billion base pairs from 9 species which can be fine-tuned for any DNA sequence analysis task. Our model can simultaneously process or output DNA sequences and numbers. In addition, our unique token design allows users to design prompts according to their own task requirements, making it applicable to any type of task. We have evaluated our model on classification, regression, and generation tasks. We demonstrate that DNAGPT benefits from pre-training, and therefore can bring performance gains to any downstream task. Our model is not only a new attempt in the field of genomes analysis, but also provides a new direction for the application of foundation models in biology.
PDF100December 15, 2024