ChatPaper.aiChatPaper

SampoNLP:面向子词分词器形态学分析的自参照工具包

SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

January 8, 2026
作者: Iaroslav Chelombitko, Ekaterina Chelombitko, Aleksey Komissarov
cs.AI

摘要

子词切分质量对大语言模型至关重要,然而对形态丰富的乌拉尔语系进行分词器评估,一直受限于清洁语素词典的缺失。我们推出SampoNLP——一个基于MDL自参照原子性评分的无语料库工具包,通过内部结构线索过滤复合形态,适用于低资源场景。利用该工具为芬兰语、匈牙利语和爱沙尼亚语生成的高纯度语素词典,我们系统评估了8k-256k词汇量范围内的BPE分词器,并提出综合性能得分(IPS)这一统一指标来权衡语素覆盖度与过度切分。通过分析IPS曲线,我们识别了收益递减的"拐点",首次为这些语言提供基于实证的最佳词汇量建议。本研究不仅提供实践指导,更定量揭示了标准BPE在处理高黏着语时的局限性。SampoNLP工具库及生成资源已开源:https://github.com/AragonerUA/SampoNLP
English
The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP
PDF01January 16, 2026