言語横断的トークナイザ手術とオフライン蒸留による多言語埋め込みモデルのトルコ語への適応

要旨

文埋め込みは、意味検索、クラスタリング、分類、検索拡張生成の基盤構成要素である。本論文では、トルコ語に特化した文埋め込みモデルであるembeddingmagibu-200mを提案する。本モデルは768次元のL2正規化ベクトルを生成し、8,192トークンのコンテキストウィンドウをサポートする。これは従来のBERTベースのトルコ語エンコーダが持つ512トークンの制限を大幅に上回る。完全な事前学習の代わりに、効率的な3段階の適応パイプラインを導入する。(1)教師モデルの語彙から冗長なトークンを削除し、40言語コーパス上の頻度分析を通じて多言語トークンを組み込むことで、131,072語彙を持つトルコ語最適化多言語トークナイザを構築する。(2)トランスフォーマー主幹重みを保持しつつ教師埋め込みモデルを複製し、平均合成トークンマッピングにより新しい語彙に対する互換性のある埋め込みテーブルを初期化する。(3)バランスの取れた40言語Wikipediaコーパス上で、コサイン類似度目的関数を用いて事前計算された教師ベクトルからのオフライン埋め込み蒸留を実行する。結果として得られる生徒モデルは約2億パラメータを持ち、訓練中にオンラインの教師推論を回避することで、単一GPU上で約4時間で訓練が完了し、総コストは5～20ドルである。実験的には、STSbTR上でピアソン/スピアマン相関係数77.55%/77.45%を達成し、3億パラメータの教師モデル(73.84%/72.92%)を上回る。TR-MTEB（26タスク）では平均スコア63.9%（26モデル中7位）を獲得し、教師より33%少ないパラメータで競争力のあるコストと品質のトレードオフを提供する。再現性と下流での利用を促進するため、モデル重み、トークナイザファイル、事前計算された埋め込みデータセット、オープンソースの複製および蒸留ツールを含むすべての成果物を公開する。

English

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.