將領域知識融入材料標記化

摘要

儘管語言模型在材料科學中的應用日益廣泛，但典型的模型仍依賴於最初為自然語言處理開發的基於頻率的標記化方法。然而，這些方法常常導致過度的碎片化和語義損失，無法保持材料概念的結構和語義完整性。為解決這一問題，我們提出了MATTER，一種將材料知識整合到標記化過程中的新穎方法。基於在我們材料知識庫上訓練的MatDetector以及一種在標記合併中優先考慮材料概念的重新排序方法，MATTER在標記化過程中保持了識別出的材料概念的結構完整性，防止了碎片化，確保其語義意義得以保留。實驗結果表明，MATTER在生成和分類任務中分別實現了平均4%和2%的性能提升，優於現有的標記化方法。這些結果凸顯了領域知識在科學文本處理標記化策略中的重要性。我們的代碼可在https://github.com/yerimoh/MATTER獲取。

English

While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of 4% and 2% in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER

將領域知識融入材料標記化

Incorporating Domain Knowledge into Materials Tokenization

摘要

Support