通过正交匹配追踪实现无需训练的分词器移植
Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
June 7, 2025
作者: Charles Goddard, Fernando Fernandes Neto
cs.AI
摘要
我们提出了一种无需训练的方法,通过正交匹配追踪(OMP)重建未见过的词元嵌入,实现预训练大语言模型(LLMs)中分词器的移植。具体而言,我们将每个词汇表外词元近似为共享词元的稀疏线性组合,分两个阶段进行:首先,利用少量共享锚点词元的小词典,在捐赠嵌入空间中计算每个新词元的表示;随后,将这些相同的稀疏系数转换回基础模型的嵌入空间。
在两项具有挑战性的跨分词器任务——LlamatoMistral NeMo(120亿参数)和QwentoLlama(10亿参数)上,我们展示了OMP在多个基准测试中实现了最佳的基础模型性能零样本保持,而其他零样本方法则显著下降。与基线方法(零初始化、均值初始化及现有方法如WECHSEL、FOCUS、ZETT)相比,OMP始终展现出最优的整体性能,无需梯度更新即可有效弥合大分词器间的差异。我们的分析进一步指出,数值分词方案的不匹配是保持数学推理能力的关键挑战。
该技术使得预训练模型权重能够直接与新分词器配合使用,促进了跨分词器的知识蒸馏、推测解码、集成、合并以及领域特定词汇的适配。我们将此方法集成至开源工具mergekit-tokensurgeon中,用于事后词汇重新对齐。
English
We present a training-free method to transplant tokenizers in pretrained
large language models (LLMs) by reconstructing unseen token embeddings via
Orthogonal Matching Pursuit (OMP). Specifically, we approximate each
out-of-vocabulary token as a sparse linear combination of shared tokens, in two
phases: first, compute each new token's representation in the donor embedding
space with a small dictionary of shared anchor tokens, then transfer these same
sparse coefficients back into the base model's embedding space.
On two challenging cross-tokenizer tasks--LlamatoMistral NeMo (12B) and
QwentoLlama (1B)--we show that OMP achieves best zero-shot preservation of
the base model's performance across multiple benchmarks, while other zero-shot
approaches degrade significantly. Compared to baselines (zero-init, mean-init,
and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves
the best overall performance, effectively bridging large tokenizer
discrepancies without gradient updates. Our analysis further identifies
mismatched numerical tokenization schemes as a critical challenge for
preserving mathematical reasoning capabilities. This technique enables direct
reuse of pretrained model weights with new tokenizers, facilitating
cross-tokenizer knowledge distillation, speculative decoding, ensembling,
merging, and domain-specific vocabulary adaptations. We integrate our method
into the open-source mergekit-tokensurgeon tool for post hoc vocabulary
realignment.