通过跨语言分词器手术与离线蒸馏将多语言嵌入模型适应土耳其语

摘要

句子嵌入是语义搜索、聚类、分类和检索增强生成的基础组件。本文提出embeddingmagibu-200m模型——一个专注于土耳其语的句子嵌入模型，可生成768维L2归一化向量，并支持8192个token的上下文窗口，远超此前基于BERT的土耳其语编码器仅512个token的限制。本工作未进行完整的预训练，而是引入了一个高效的三阶段适应流程：(1) 通过从教师分词器的词汇表中剪枝冗余标记，并结合基于40种语言语料库频率分析引入多语言标记，构建一个词汇量为131,072的土耳其语优化多语言分词器；(2) 克隆教师嵌入模型，保持变压器骨干权重不变，并通过均值组合标记映射为新的词汇表初始化兼容的嵌入表；(3) 使用预计算的教师向量，在平衡的40种语言维基百科语料库上以余弦相似度为目标进行离线嵌入蒸馏。所得学生模型约含2亿参数，通过在训练过程中避免在线教师推理，可在单个GPU上约四小时内完成训练，总成本为5-20美元。实验结果表明，该模型在STSbTR上的皮尔逊/斯皮尔曼相关系数分别达到77.55%/77.45%，超越了含3亿参数的教师模型（73.84%/72.92%）。在TR-MTEB（26项任务）上，平均得分63.9%（在26个模型中排名第7），以比教师模型少33%的参数提供了有竞争力的性价比。为促进可复现性和下游应用，所有成果均已开源，包括模型权重、分词器文件、预计算嵌入数据集以及开源克隆与蒸馏工具。

English

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.