Morpheus: トルコ語のための形態素認識型ニューラルトークナイザーおよび単語埋め込み器

要旨

トルコ語は膠着的である。すなわち、意味は形態素によって担われる。しかし、現代の言語モデルを駆動するサブワードトークナイザはコーパス統計に基づいて単語を分割し、意味を担う接辞を断片化し、WordPieceやルールベースの解析器の場合には、その出力を元のテキストにデコードすることに失敗する。本論文では、トルコ語向けのニューラル形態素境界モデルMorpheusを提案する。これは、ロスレスで形態論を考慮したトークナイザであると同時に、単語埋め込み生成器でもある。微分可能なポアソン二項動的計画法により、トレーニング中は文字ごとの境界確率をソフトな形態素所属度に変換し、推論時には厳密なセグメントを得る。文字列の正規化は行わないため、decode(encode(w)) = w が設計上保証される。モデルがニューラルであるため、トークン化と同じ順伝播で構造化された単語埋め込みも出力される。可逆トークナイザ（生成に有効な唯一のトークナイザ）の中で、Morpheusは最も低い文字あたりビット数（1.425）を達成し、サブワードファミリーの正解形態素アライメントを約2倍に向上させ（MorphScore マクロF1 0.61 vs. 約0.32）、64K語彙のサブワードトークナイザと比較してGPUメモリを約19%削減する。埋め込みモデルとして、凍結されたMorpheusのベクトルは語彙検索（語根ファミリーのMAP 0.85）と同語根検証（ROC-AUC 1.00）でリードし、多言語検索モデルBGE-M3やBERTurkを上回る。一方、文脈や屈折に依存するタスク（NER、格/数プロービング）では、より重い文脈エンコーダが依然として優れている。このトレードオフは、Morpheusの語根中心の幾何構造に起因すると考えられる。コード: https://github.com/lonewolf-rd/TurkishMorpheus; モデル: https://huggingface.co/lonewolflab/Morpheus-TR-50K; インタラクティブデモ: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo。

English

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.