PhysBERT:物理科学文献的文本嵌入模型
PhysBERT: A Text Embedding Model for Physics Scientific Literature
August 18, 2024
作者: Thorsten Hellert, João Montenegro, Andrea Pollastro
cs.AI
摘要
物理学中的专业术语和复杂概念对通过自然语言处理(NLP)进行信息提取构成重大挑战。对于有效的NLP应用来说,文本嵌入模型至关重要,它将文本转换为密集向量表示,以实现高效的信息检索和语义分析。在这项工作中,我们介绍了PhysBERT,这是第一个针对物理学的文本嵌入模型。PhysBERT在一个精心筛选的包含120万篇arXiv物理论文的语料库上进行了预训练,并通过监督数据进行了微调,其在物理学特定任务上表现优于领先的通用模型,包括在特定物理学子领域进行微调的效果。
English
The specialized language and complex concepts in physics pose significant
challenges for information extraction through Natural Language Processing
(NLP). Central to effective NLP applications is the text embedding model, which
converts text into dense vector representations for efficient information
retrieval and semantic analysis. In this work, we introduce PhysBERT, the first
physics-specific text embedding model. Pre-trained on a curated corpus of 1.2
million arXiv physics papers and fine-tuned with supervised data, PhysBERT
outperforms leading general-purpose models on physics-specific tasks including
the effectiveness in fine-tuning for specific physics subdomains.Summary
AI-Generated Summary