BioMamba: 사전 훈련된 생물 의학 언어 표현 모델 Mamba를 활용

초록

생물학에서 자연어 처리(NLP)의 발전은 모델이 복잡한 생물 의학 문헌을 해석하는 능력에 달려있습니다. 전통적인 모델은 이 분야의 복잡하고 도메인 특화된 언어에 어려움을 겪곤 합니다. 본 논문에서는 생물 의학 텍스트 마이닝을 위해 특별히 설계된 사전 훈련된 모델인 BioMamba를 제안합니다. BioMamba는 Mamba 아키텍처를 기반으로 하며 광범위한 생물 의학 문헌 말뭉치에서 사전 훈련되었습니다. 우리의 경험적 연구는 BioMamba가 생물 의학 작업 전반에서 BioBERT나 일반 도메인 Mamba와 같은 모델을 현격하게 능가함을 입증합니다. 예를 들어, BioMamba는 BioASQ 테스트 세트에서 편협도(perplexity)를 100배, 교차 엔트로피 손실을 4배 줄였습니다. 우리는 모델 아키텍처, 사전 훈련 과정, 그리고 세밀한 조정 기술에 대한 개요를 제공합니다. 게다가, 우리는 추가 연구를 촉진하기 위해 코드와 훈련된 모델을 공개합니다.

English

The advancement of natural language processing (NLP) in biology hinges on models' ability to interpret intricate biomedical literature. Traditional models often struggle with the complex and domain-specific language in this field. In this paper, we present BioMamba, a pre-trained model specifically designed for biomedical text mining. BioMamba builds upon the Mamba architecture and is pre-trained on an extensive corpus of biomedical literature. Our empirical studies demonstrate that BioMamba significantly outperforms models like BioBERT and general-domain Mamba across various biomedical tasks. For instance, BioMamba achieves a 100 times reduction in perplexity and a 4 times reduction in cross-entropy loss on the BioASQ test set. We provide an overview of the model architecture, pre-training process, and fine-tuning techniques. Additionally, we release the code and trained model to facilitate further research.

BioMamba: 사전 훈련된 생물 의학 언어 표현 모델 Mamba를 활용

BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba

초록

Support