PhysBERT：物理科學文獻的文本嵌入模型

摘要

物理學中專業術語和複雜概念對透過自然語言處理（NLP）進行信息提取構成重大挑戰。對於有效的NLP應用來說，文本嵌入模型至關重要，它將文本轉換為密集向量表示，以實現高效的信息檢索和語義分析。在這項工作中，我們介紹了PhysBERT，這是第一個針對物理學的文本嵌入模型。PhysBERT在經過1.2百萬篇arXiv物理論文的精心策劃語料庫上進行預訓練，並通過監督數據進行微調，優於領先的通用模型在物理學特定任務上的表現，包括對特定物理學子領域進行微調的效果。

English

The specialized language and complex concepts in physics pose significant challenges for information extraction through Natural Language Processing (NLP). Central to effective NLP applications is the text embedding model, which converts text into dense vector representations for efficient information retrieval and semantic analysis. In this work, we introduce PhysBERT, the first physics-specific text embedding model. Pre-trained on a curated corpus of 1.2 million arXiv physics papers and fine-tuned with supervised data, PhysBERT outperforms leading general-purpose models on physics-specific tasks including the effectiveness in fine-tuning for specific physics subdomains.

PhysBERT：物理科學文獻的文本嵌入模型

PhysBERT: A Text Embedding Model for Physics Scientific Literature

摘要

Support