PhysBERT: 物理学科学文献向けテキスト埋め込みモデル

要旨

物理学の専門用語と複雑な概念は、自然言語処理（NLP）による情報抽出において重大な課題を提起します。効果的なNLPアプリケーションの中核となるのは、テキストを密なベクトル表現に変換し、効率的な情報検索と意味解析を可能にするテキスト埋め込みモデルです。本研究では、物理学に特化した初のテキスト埋め込みモデルであるPhysBERTを紹介します。120万件のarXiv物理学論文から構成された精選されたコーパスで事前学習され、教師ありデータで微調整されたPhysBERTは、特定の物理学サブドメインにおける微調整の有効性を含む、物理学特有のタスクにおいて主要な汎用モデルを凌駕する性能を示します。

English

The specialized language and complex concepts in physics pose significant challenges for information extraction through Natural Language Processing (NLP). Central to effective NLP applications is the text embedding model, which converts text into dense vector representations for efficient information retrieval and semantic analysis. In this work, we introduce PhysBERT, the first physics-specific text embedding model. Pre-trained on a curated corpus of 1.2 million arXiv physics papers and fine-tuned with supervised data, PhysBERT outperforms leading general-purpose models on physics-specific tasks including the effectiveness in fine-tuning for specific physics subdomains.

PhysBERT: 物理学科学文献向けテキスト埋め込みモデル

PhysBERT: A Text Embedding Model for Physics Scientific Literature

要旨

Support