大规模土耳其语最优子词策略研究:数据、词汇与形态学互动的系统评估
Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay
February 6, 2026
作者: Duygu Altinok
cs.AI
摘要
分词是形态丰富语言(如土耳其语)神经语言建模的关键设计选择,这类语言的能产性黏着特性对词汇效率与形态保真度构成双重挑战。现有研究虽探索了分词器家族与词汇量规模,但普遍存在三大局限:(i) 变更词汇量时未系统控制分词器训练语料;(ii) 缺乏细粒度内在诊断指标;(iii) 下游任务评估范围狭窄。我们首次对土耳其语子词分词展开系统化研究,提出"子词宣言"框架:通过联合调控词汇量与分词器训练语料规模(数据与词汇耦合),在参数预算匹配条件下比较多种分词器家族(WordPiece、形态级分词器与字符基线),并在语义(自然语言推理、语义文本相似度、情感分析、命名实体识别)、句法(词性标注、依存解析)及形态敏感探针任务上进行综合评估。为解析分词器成败根源,我们开发了形态感知诊断工具包,突破粗粒度聚合指标局限,引入边界级微观/宏观F1、解耦的词干原子性与表层边界命中率、过/欠分割指数、字符/词语编辑距离、接续率、词缀类型覆盖度及词例级原子性等多维指标。本研究的四重贡献在于:(i) 系统探索词汇-语料-性能三元关系;(ii) 建立连接内在诊断与外在性能的统一形态评估框架;(iii) 通过受控实验明确字符级与形态级分词器的优势场景;(iv) 开源评估代码、分词流水线及模型。作为该领域开创性工作,本"子词宣言"为形态丰富语言构建高效分词器提供了可操作的指导原则,并为未来研究奠定了可复现的基础。
English
Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer's training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a "subwords manifest", that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this "subwords manifest" delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.