ChatPaper.aiChatPaper

基於正交匹配追蹤的免訓練分詞器移植

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

June 7, 2025
作者: Charles Goddard, Fernando Fernandes Neto
cs.AI

摘要

我們提出了一種無需訓練的方法,通過正交匹配追蹤(OMP)重建未見詞元的嵌入,來移植預訓練大型語言模型(LLMs)中的分詞器。具體而言,我們將每個詞彙表外的詞元近似為共享詞元的稀疏線性組合,分為兩個階段:首先,使用少量共享錨定詞元的小字典計算每個新詞元在捐贈嵌入空間中的表示,然後將這些相同的稀疏係數轉移回基礎模型的嵌入空間。 在兩個具有挑戰性的跨分詞器任務——LlamatoMistral NeMo(12B)和QwentoLlama(1B)上,我們展示了OMP在多個基準測試中實現了最佳的零樣本性能保留,而其他零樣本方法則顯著退化。與基線方法(零初始化、均值初始化以及現有方法如WECHSEL、FOCUS、ZETT)相比,OMP始終實現最佳的整體性能,有效彌合了大型分詞器之間的差異,而無需梯度更新。我們的分析進一步指出,不匹配的數字分詞方案是保留數學推理能力的關鍵挑戰。該技術使得能夠直接重用預訓練模型權重與新分詞器,促進跨分詞器知識蒸餾、推測解碼、集成、合併以及特定領域詞彙適應。我們將該方法集成到開源工具mergekit-tokensurgeon中,用於事後詞彙重新對齊。
English
We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--LlamatoMistral NeMo (12B) and QwentoLlama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.
PDF22June 11, 2025