将领域知识融入材料标记化

摘要

尽管语言模型在材料科学中的应用日益广泛，但典型模型仍依赖于最初为自然语言处理设计的基于频率的分词方法。然而，这些方法常常导致过度分割和语义丢失，无法保持材料概念的结构和语义完整性。为解决这一问题，我们提出了MATTER，一种将材料知识融入分词过程的新颖方法。MATTER基于我们材料知识库训练的MatDetector以及一种在分词合并中优先考虑材料概念的重新排序方法，确保了识别出的材料概念在分词过程中的结构完整性，避免了分割，从而保持了其语义意义的完整。实验结果表明，MATTER在生成和分类任务中分别实现了平均4%和2%的性能提升，优于现有的分词方法。这些结果凸显了领域知识在科学文本处理分词策略中的重要性。我们的代码可在https://github.com/yerimoh/MATTER获取。

English

While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of 4% and 2% in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER

将领域知识融入材料标记化

Incorporating Domain Knowledge into Materials Tokenization

摘要

Support