교차 언어 토크나이저 수술 및 오프라인 증류를 통한 다국어 임베딩 모델의 터키어 적응

초록

문장 임베딩은 의미 기반 검색, 군집화, 분류 및 검색 증강 생성의 기초 구성 요소이다. 본 논문은 터키어에 특화된 문장 임베딩 모델인 embeddingmagibu-200m을 제시한다. 이 모델은 768차원의 L2 정규화 벡터를 생성하며, 기존 BERT 기반 터키어 인코더의 512토큰 제한을 훨씬 초과하는 8,192토큰 컨텍스트 윈도우를 지원한다. 전체 사전학습 대신 효율적인 3단계 적응 파이프라인이 도입되었다: (1) 교사 모델의 어휘에서 중복 토큰을 제거하고 40개 언어 말뭉치에 대한 빈도 분석을 통해 다국어 토큰을 통합하여 131,072개 어휘를 갖춘 터키어 최적화 다국어 토크나이저를 구축하고, (2) 트랜스포머 백본 가중치는 유지하면서 새로운 어휘에 대해 평균 구성 토큰 매핑을 통해 호환 가능한 임베딩 테이블을 초기화하여 교사 임베딩 모델을 복제하며, (3) 균형 잡힌 40개 언어 위키피디아 말뭉치에 대해 코사인 유사도 목적 함수를 사용하여 사전 계산된 교사 벡터로부터 오프라인 임베딩 증류를 수행한다. 결과적으로 생성된 학생 모델은 약 2억 개의 파라미터를 가지며, 단일 GPU에서 학습 중 온라인 교사 추론을 피함으로써 총 5~20달러의 비용으로 약 4시간 만에 학습된다. 실험적으로 STSbTR에서 피어슨/스피어만 상관계수 77.55%/77.45%를 달성하여 3억 개 파라미터의 교사 모델(73.84%/72.92%)을 능가한다. TR-MTEB(26개 과제)에서는 평균 63.9%의 점수(26개 모델 중 7위)를 기록하여 교사보다 33% 적은 파라미터로 경쟁력 있는 비용 대비 성능을 제공한다. 재현성 및 다운스트림 사용을 용이하게 하기 위해 모델 가중치, 토크나이저 파일, 사전 계산된 임베딩 데이터셋 및 오픈소스 복제·증류 도구 등 모든 아티팩트가 공개된다.

English

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.