SampoNLP: Een zelf-referentiële toolkit voor morfologische analyse van subword-tokenizers

Samenvatting

De kwaliteit van subwoord-tokenisatie is cruciaal voor Large Language Models, maar de evaluatie van tokenizers voor morfologisch rijke Oeralische talen wordt bemoeilijkt door het ontbreken van schone morfeemlexicons. Wij introduceren SampoNLP, een corpusvrije toolkit voor het creëren van morfologische lexicons met behulp van MDL-geïnspireerde Self-Referential Atomicity Scoring, die samengestelde vormen filtert op basis van interne structurele aanwijzingen – geschikt voor settings met weinig bronnen. Met behulp van de hoogzuivere lexicons gegenereerd door SampoNLP voor Fins, Hongaars en Estisch voeren we een systematische evaluatie uit van BPE-tokenizers over een reeks vocabulairegroottes (8k-256k). Wij stellen een uniforme metriek voor, de Integrated Performance Score (IPS), om de afweging tussen morfeemdekking en overmatige splitsing te navigeren. Door de IPS-curves te analyseren, identificeren we de "elbow points" van afnemende meeropbrengsten en geven we de eerste empirisch onderbouwde aanbevelingen voor optimale vocabulairegroottes (k) voor deze talen. Onze studie biedt niet alleen praktische richtlijnen, maar demonstreert ook kwantitatief de beperkingen van standaard BPE voor sterk agglutinerende talen. De SampoNLP-bibliotheek en alle gegenereerde bronnen zijn publiekelijk beschikbaar gesteld: https://github.com/AragonerUA/SampoNLP

English

The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP

SampoNLP: Een zelf-referentiële toolkit voor morfologische analyse van subword-tokenizers

SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

Samenvatting

Support