ChatPaper.aiChatPaper

INDUS:科学应用中的有效和高效语言模型

INDUS: Effective and Efficient Language Models for Scientific Applications

May 17, 2024
作者: Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kayleen Bugbee, Mike Little, Elizabeth Fancher, Lauren Sanders, Sylvain Costes, Sergi Blanco-Cuaresma, Kelly Lockhart, Thomas Allen, Felix Grazes, Megan Ansdel, Alberto Accomazzi, Yousef El-Kurdi, Davis Wertheimer, Birgit Pfitzmann, Cesar Berrospi Ramis, Michele Dolfi, Rafael Teixeira de Lima, Panos Vegenas, S. Karthik Mukkavilli, Peter Staar, Sanaz Vahidinia, Ryan McGranaghan, Armin Mehrabian, Tsendgar Lee
cs.AI

摘要

在通用领域语料库上训练的大型语言模型(LLMs)在自然语言处理(NLP)任务上展现出显著成果。然而,先前的研究表明,使用面向特定领域语料库训练的LLMs在专业任务上表现更好。受到这一关键洞察的启发,我们开发了INDUS,这是一套专为地球科学、生物学、物理学、太阳物理学、行星科学和天体物理学领域量身定制的LLMs套件,使用从多样数据源中提取的策划科学语料库进行训练。这套模型包括:(1)使用领域特定词汇和语料库训练的编码器模型,用于处理自然语言理解任务,(2)基于对比学习的通用文本嵌入模型,使用来自多个来源的多样数据集进行训练,用于处理信息检索任务,以及(3)使用知识蒸馏技术创建的这些模型的较小版本,用于处理具有延迟或资源约束的应用。我们还创建了三个新的科学基准数据集,分别是CLIMATE-CHANGE-NER(实体识别)、NASA-QA(抽取式问答)和NASA-IR(信息检索),以加速这些跨学科领域的研究。最后,我们展示了我们的模型在这些新任务以及感兴趣领域现有基准任务上均优于通用编码器(RoBERTa)和现有领域特定编码器(SciBERT)。
English
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.

Summary

AI-Generated Summary

PDF361December 15, 2024