Morpheus: 터키어를 위한 형태소 인식 신경망 토크나이저 및 단어 임베더

초록

터키어는 교착어로, 의미가 형태소에 의해 전달되지만 현대 언어 모델을 구동하는 하위 단어 토크나이저는 말뭉치 통계에 따라 단어를 분할하여 의미적 부담을 가진 접사를 조각내고, WordPiece 및 규칙 기반 분석기의 경우 출력을 원문으로 다시 디코딩하는 데 실패한다. 본 논문은 터키어를 위한 신경 형태소 경계 모델인 Morpheus를 제시하는데, 이는 무손실이며 형태소를 인식하는 토크나이저이자 단어 임베딩 생성기 역할을 동시에 수행한다. 미분 가능한 푸아송-이항 동적 프로그래밍은 훈련 중 문자별 경계 확률을 소프트 형태소 소속으로 변환하고 추론 시에는 정확한 세그먼트를 생성하며, 문자열 정규화가 없으므로 설계상 디코드(인코드(w)) = w가 성립한다. 신경 모델이기 때문에 토큰화를 수행하는 동일한 순전파가 구조화된 단어 임베딩도 함께 출력한다. 가역 토크나이저 중에서 – 생성에 유효한 유일한 토크나이저 – Morpheus는 가장 낮은 문자당 비트 수(1.425)를 달성하고, 하위 단어 계열의 형태소 정렬 정답을 대략 두 배로 높이며(MorphScore 매크로 F1 0.61 대 ~0.32), 64K 어휘 하위 단어 토크나이저보다 약 19% 적은 GPU 메모리를 사용한다. 임베더로서, 고정된 Morpheus 벡터는 어휘 검색(어근 집합 MAP 0.85) 및 동일 어근 검증(ROC-AUC 1.00)에서 선두를 차지하여 다국어 검색기 BGE-M3와 BERTurk를 능가한다. 문맥 및 굴절 의존 작업(NER, 격/수 탐침)에서는 더 무거운 문맥 인코더가 여전히 앞서 있으며, 이는 Morpheus의 어근 중심 기하 구조에 기인한 트레이드오프이다. 코드: https://github.com/lonewolf-rd/TurkishMorpheus; 모델: https://huggingface.co/lonewolflab/Morpheus-TR-50K; 대화형 데모: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

English

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.