PhysBERT:物理科學文獻的文本嵌入模型
PhysBERT: A Text Embedding Model for Physics Scientific Literature
August 18, 2024
作者: Thorsten Hellert, João Montenegro, Andrea Pollastro
cs.AI
摘要
物理學中專業術語和複雜概念對透過自然語言處理(NLP)進行信息提取構成重大挑戰。對於有效的NLP應用來說,文本嵌入模型至關重要,它將文本轉換為密集向量表示,以實現高效的信息檢索和語義分析。在這項工作中,我們介紹了PhysBERT,這是第一個針對物理學的文本嵌入模型。PhysBERT在經過1.2百萬篇arXiv物理論文的精心策劃語料庫上進行預訓練,並通過監督數據進行微調,優於領先的通用模型在物理學特定任務上的表現,包括對特定物理學子領域進行微調的效果。
English
The specialized language and complex concepts in physics pose significant
challenges for information extraction through Natural Language Processing
(NLP). Central to effective NLP applications is the text embedding model, which
converts text into dense vector representations for efficient information
retrieval and semantic analysis. In this work, we introduce PhysBERT, the first
physics-specific text embedding model. Pre-trained on a curated corpus of 1.2
million arXiv physics papers and fine-tuned with supervised data, PhysBERT
outperforms leading general-purpose models on physics-specific tasks including
the effectiveness in fine-tuning for specific physics subdomains.Summary
AI-Generated Summary