BioMamba: Mambaを活用した事前学習済み生体医科学言語表現モデル

要旨

生物学における自然言語処理（NLP）の進展は、複雑な生物医学文献を解釈するモデルの能力にかかっている。従来のモデルは、この分野の複雑でドメイン固有の言語に対処するのに苦労することが多い。本論文では、生物医学テキストマイニングに特化して設計された事前学習モデルであるBioMambaを紹介する。BioMambaはMambaアーキテクチャを基盤としており、広範な生物医学文献のコーパスで事前学習されている。我々の実証研究により、BioMambaがBioBERTや汎用ドメインのMambaなどのモデルを、さまざまな生物医学タスクにおいて大幅に上回ることが示されている。例えば、BioMambaはBioASQテストセットにおいて、パープレキシティを100分の1に、クロスエントロピー損失を4分の1に削減する。本論文では、モデルアーキテクチャ、事前学習プロセス、およびファインチューニング技術の概要を提供する。さらに、さらなる研究を促進するために、コードと学習済みモデルを公開する。

English

The advancement of natural language processing (NLP) in biology hinges on models' ability to interpret intricate biomedical literature. Traditional models often struggle with the complex and domain-specific language in this field. In this paper, we present BioMamba, a pre-trained model specifically designed for biomedical text mining. BioMamba builds upon the Mamba architecture and is pre-trained on an extensive corpus of biomedical literature. Our empirical studies demonstrate that BioMamba significantly outperforms models like BioBERT and general-domain Mamba across various biomedical tasks. For instance, BioMamba achieves a 100 times reduction in perplexity and a 4 times reduction in cross-entropy loss on the BioASQ test set. We provide an overview of the model architecture, pre-training process, and fine-tuning techniques. Additionally, we release the code and trained model to facilitate further research.

BioMamba: Mambaを活用した事前学習済み生体医科学言語表現モデル

BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba

要旨

Support