ChatPaper.aiChatPaper

INDUS:科學應用的有效且高效語言模型

INDUS: Effective and Efficient Language Models for Scientific Applications

May 17, 2024
作者: Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kayleen Bugbee, Mike Little, Elizabeth Fancher, Lauren Sanders, Sylvain Costes, Sergi Blanco-Cuaresma, Kelly Lockhart, Thomas Allen, Felix Grazes, Megan Ansdel, Alberto Accomazzi, Yousef El-Kurdi, Davis Wertheimer, Birgit Pfitzmann, Cesar Berrospi Ramis, Michele Dolfi, Rafael Teixeira de Lima, Panos Vegenas, S. Karthik Mukkavilli, Peter Staar, Sanaz Vahidinia, Ryan McGranaghan, Armin Mehrabian, Tsendgar Lee
cs.AI

摘要

在一般領域語料庫上訓練的大型語言模型(LLMs)在自然語言處理(NLP)任務上展現出卓越的成果。然而,先前的研究表明,使用以特定領域為重點的語料庫訓練的LLMs在專業任務上表現更佳。受到這一重要見解的啟發,我們開發了INDUS,這是一套針對地球科學、生物學、物理學、太陽物理學、行星科學和天文物理學領域量身定制的LLMs套件,並使用從不同數據來源中提取的策劃科學語料庫進行訓練。這套模型包括:(1)使用特定領域詞彙和語料庫訓練的編碼器模型,以應對自然語言理解任務,(2)基於對比學習的通用文本嵌入模型,使用從多個來源提取的多樣數據集進行訓練,以應對信息檢索任務,以及(3)使用知識蒸餾技術創建的這些模型的較小版本,以應對具有延遲或資源限制的應用。我們還創建了三個新的科學基準數據集,分別是CLIMATE-CHANGE-NER(實體識別)、NASA-QA(抽取式QA)和NASA-IR(IR),以加速這些跨學科領域的研究。最後,我們展示了我們的模型在這些新任務以及感興趣領域現有基準任務上均優於通用編碼器(RoBERTa)和現有特定領域編碼器(SciBERT)。
English
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.

Summary

AI-Generated Summary

PDF361December 15, 2024