직교 매칭 추적법을 통한 학습 없이 이식 가능한 토크나이저

초록

사전 학습된 대규모 언어 모델(LLM)에서 토크나이저를 이식하기 위한 훈련이 필요 없는 방법을 제안합니다. 이 방법은 직교 매칭 추적(Orthogonal Matching Pursuit, OMP)을 통해 보이지 않는 토큰 임베딩을 재구성합니다. 구체적으로, 각 어휘 외(out-of-vocabulary) 토큰을 공유 토큰의 희소 선형 조합으로 근사화하는 두 단계를 거칩니다: 먼저, 공유 앵커 토큰의 작은 사전을 사용하여 기증자 임베딩 공간에서 각 새 토큰의 표현을 계산한 다음, 동일한 희소 계수를 기본 모델의 임베딩 공간으로 전송합니다. 두 가지 도전적인 크로스 토크나이저 작업(LlamatoMistral NeMo (12B) 및 QwentoLlama (1B))에서 OMP가 여러 벤치마크에서 기본 모델의 성능을 최고의 제로샷(zero-shot) 보존을 달성하는 반면, 다른 제로샷 접근 방식은 크게 저하됨을 보여줍니다. 기준선(zero-init, mean-init 및 WECHSEL, FOCUS, ZETT와 같은 기존 접근 방식)과 비교했을 때, OMP는 일관되게 최고의 전반적인 성능을 달성하며, 그래디언트 업데이트 없이도 큰 토크나이저 불일치를 효과적으로 해결합니다. 우리의 분석은 수치 토큰화 방식의 불일치가 수학적 추론 능력을 보존하는 데 있어 중요한 과제임을 추가로 확인했습니다. 이 기술은 새로운 토크나이저와 함께 사전 학습된 모델 가중치를 직접 재사용할 수 있게 하여, 크로스 토크나이저 지식 증류, 스펙티브 디코딩, 앙상블, 병합 및 도메인 특화 어휘 적응을 용이하게 합니다. 우리는 이 방법을 오픈소스 mergekit-tokensurgeon 도구에 통합하여 사후 어휘 재조정을 가능하게 했습니다.

English

We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--LlamatoMistral NeMo (12B) and QwentoLlama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.

직교 매칭 추적법을 통한 학습 없이 이식 가능한 토크나이저

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

초록

Support