Morpheus: Ein morphologiebewusster neuronaler Tokenisierer und Wort-Embedder für Türkisch

Zusammenfassung

Türkisch ist agglutinierend: Bedeutung wird durch Morpheme transportiert, doch die Subwort-Tokenisierer, die moderne Sprachmodelle antreiben, zerlegen Wörter nach Korpusstatistiken, zerschlagen semantisch aufgeladene Suffixe und – im Fall von WordPiece und regelbasierten Analysatoren – scheitern daran, ihre Ausgabe zurück zum Originaltext zu dekodieren. Dieses Paper stellt Morpheus vor, ein neuronales Morphemgrenzen-Modell für Türkisch, das zugleich ein verlustfreier, morphologiebewusster Tokenisierer und ein Produzent von Worteinbettungen ist. Ein differenzierbares Poisson-Binomial-Dynamisches Programm wandelt zeichenweise Grenzwahrscheinlichkeiten während des Trainings in weiche Morphemzugehörigkeiten und bei der Inferenz in exakte Segmente um, ohne String-Normalisierung, sodass decode(encode(w)) = w konstruktionsbedingt gilt. Da das Modell neuronal ist, erzeugt derselbe Vorwärtsdurchlauf, der tokenisiert, auch eine strukturierte Worteinbettung. Unter reversiblen Tokenisierern – den einzigen, die für die Generierung valide sind – erreicht Morpheus die niedrigsten Bits pro Zeichen (1,425), verdoppelt in etwa die goldene morphologische Ausrichtung der Subwort-Familie (MorphScore macro-F1 0,61 vs. ~0,32) und benötigt ~19 % weniger GPU-Speicher als Subwort-Tokenisierer mit 64K-Vokabular. Als Embedder führen eingefrorene Morpheus-Vektoren bei lexikalischem Retrieval (Root-Family MAP 0,85) und Gleichstamm-Verifikation (ROC-AUC 1,00) und übertreffen den multilingualen Retriever BGE-M3 sowie BERTurk; bei kontext- und flexionsabhängigen Aufgaben (NER, Kasus/Numerus-Probing) bleiben die schwereren kontextuellen Encoder vorn – ein Kompromiss, den wir auf Morpheus' wurzelzentrierte Geometrie zurückführen. Code: https://github.com/lonewolf-rd/TurkishMorpheus; Modell: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interaktive Demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

English

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.