Morpheus:一种面向土耳其语的形态感知神经分词器与词嵌入器
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
June 17, 2026
作者: Tolga Şakar
cs.AI
摘要
土耳其语是黏着语:意义由词素承载,但驱动现代语言模型的子词分词器却依据语料统计切分词汇,破坏具有语义的后缀——对于WordPiece和基于规则的分析器而言,甚至无法将输出解码回原始文本。本文提出Morpheus,一个针对土耳其语的神经词素边界模型,它同时具备无损、形态感知的分词器与词嵌入生成器的功能。通过可微的泊松-二项动态规划,在训练时将每个字符的边界概率转化为软性词素隶属度,在推理时生成精确分段,无需字符串归一化,因此decode(encode(w)) = w成立。由于模型是神经性的,同一前向传播过程既完成分词又输出结构化的词嵌入。在可逆分词器(唯一适用于生成任务的分词器)中,Morpheus实现了最低的每字符比特数(1.425),将子词家族的金标准形态对齐程度大致翻倍(MorphScore宏F1 0.61对比约0.32),并且相比64K词汇量的子词分词器节省约19%的GPU内存。作为嵌入器,冻结的Morphus向量在词汇检索(词根族MAP 0.85)和同根验证(ROC-AUC 1.00)方面超越多语言检索器BGE-M3和BERTurk;在依赖上下文和屈折变化的任务(命名实体识别、格/数探针)中,更重的上下文编码器仍保持领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码:https://github.com/lonewolf-rd/TurkishMorpheus;模型:https://huggingface.co/lonewolflab/Morpheus-TR-50K;交互式演示:https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo。
English
Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.