ChatPaper.aiChatPaper

DNAGPT:一個通用的預訓練工具,用於多個DNA序列分析任務

DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks

July 11, 2023
作者: Daoan Zhang, Weitong Zhang, Bing He, Jianguo Zhang, Chenchen Qin, Jianhua Yao
cs.AI

摘要

GPT 系列的成功證明了 GPT 能夠從序列中提取一般性資訊,從而使所有下游任務受益。這激勵我們利用預訓練模型來探索 DNA 序列中的隱藏信息。然而,在 DNA 序列分析中的數據和任務要求是復雜和多樣的,因為 DNA 相關數據包括不同類型的信息,如序列、表達水平等,目前還沒有專門為這些特徵設計的模型。因此,我們提出了 DNAGPT,這是一個通用的基礎模型,預先在來自 9 個物種的超過 100 億個鹼基對上進行了預訓練,可以針對任何 DNA 序列分析任務進行微調。我們的模型可以同時處理或輸出 DNA 序列和數字。此外,我們獨特的標記設計允許用戶根據自己的任務需求設計提示,使其適用於任何類型的任務。我們已對我們的模型進行了分類、回歸和生成任務的評估。我們展示了 DNAGPT 從預訓練中受益,因此可以為任何下游任務帶來性能提升。我們的模型不僅是基因組分析領域的一次新嘗試,還為基礎模型在生物學中的應用提供了一個新方向。
English
The success of the GPT series proves that GPT can extract general information from sequences, thereby benefiting all downstream tasks. This motivates us to use pre-trained models to explore the hidden information in DNA sequences. However, data and task requirements in DNA sequence analysis are complexity and diversity as DNA relevant data includes different types of information, such as sequences, expression levels, etc, while there is currently no model specifically designed for these characteristics. Hereby, we present DNAGPT, a generalized foundation model pre-trained on over 10 billion base pairs from 9 species which can be fine-tuned for any DNA sequence analysis task. Our model can simultaneously process or output DNA sequences and numbers. In addition, our unique token design allows users to design prompts according to their own task requirements, making it applicable to any type of task. We have evaluated our model on classification, regression, and generation tasks. We demonstrate that DNAGPT benefits from pre-training, and therefore can bring performance gains to any downstream task. Our model is not only a new attempt in the field of genomes analysis, but also provides a new direction for the application of foundation models in biology.
PDF100December 15, 2024