將領域知識融入材料標記化
Incorporating Domain Knowledge into Materials Tokenization
June 9, 2025
作者: Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee
cs.AI
摘要
儘管語言模型在材料科學中的應用日益廣泛,但典型的模型仍依賴於最初為自然語言處理開發的基於頻率的標記化方法。然而,這些方法常常導致過度的碎片化和語義損失,無法保持材料概念的結構和語義完整性。為解決這一問題,我們提出了MATTER,一種將材料知識整合到標記化過程中的新穎方法。基於在我們材料知識庫上訓練的MatDetector以及一種在標記合併中優先考慮材料概念的重新排序方法,MATTER在標記化過程中保持了識別出的材料概念的結構完整性,防止了碎片化,確保其語義意義得以保留。實驗結果表明,MATTER在生成和分類任務中分別實現了平均4%和2%的性能提升,優於現有的標記化方法。這些結果凸顯了領域知識在科學文本處理標記化策略中的重要性。我們的代碼可在https://github.com/yerimoh/MATTER獲取。
English
While language models are increasingly utilized in materials science, typical
models rely on frequency-centric tokenization methods originally developed for
natural language processing. However, these methods frequently produce
excessive fragmentation and semantic loss, failing to maintain the structural
and semantic integrity of material concepts. To address this issue, we propose
MATTER, a novel tokenization approach that integrates material knowledge into
tokenization. Based on MatDetector trained on our materials knowledge base and
a re-ranking method prioritizing material concepts in token merging, MATTER
maintains the structural integrity of identified material concepts and prevents
fragmentation during tokenization, ensuring their semantic meaning remains
intact. The experimental results demonstrate that MATTER outperforms existing
tokenization methods, achieving an average performance gain of 4% and 2%
in the generation and classification tasks, respectively. These results
underscore the importance of domain knowledge for tokenization strategies in
scientific text processing. Our code is available at
https://github.com/yerimoh/MATTER