通过启发式适应与超词元学习实现语言模型分词器的灵活性

摘要

预训练语言模型（LLMs）常受限于其固定的分词方案，导致效率低下和性能受限，尤其是在多语言或专业应用场景中。这种分词器锁定现象带来了显著挑战。传统的克服方法往往需要极高的计算资源。尽管通过启发式初始化进行分词器替换旨在减轻这一负担，但现有方法通常需要进行详尽的残差微调，且可能无法完全保留语义细微差别或有效解决底层的压缩效率问题。我们的框架引入了两项创新：首先，Tokenadapt，一种模型无关的分词器移植方法；其次，新颖的多词超分词预学习技术，以增强压缩效果并减少碎片化。Tokenadapt通过结合两种方法的混合启发式策略来初始化新的唯一分词嵌入：一是基于旧分词器的子词分解进行局部估计，二是利用原始词汇表中语义最接近的前k个分词进行全局估计。该方法旨在保留语义的同时，显著减少重新训练的需求。实证研究验证了这两项贡献：移植启发式成功初始化了唯一分词，明显优于包括Transtokenizer和ReTok在内的传统基线方法和复杂方法，而我们的超分词则实现了显著的压缩增益。我们的零样本困惑度结果显示，与ReTok和TransTokenizer基线相比，TokenAdapt混合初始化在不同基础模型和新训练的目标分词器上均能持续产生更低的困惑度比率。TokenAdapt通常显著降低了整体困惑度比率，与ReTok相比，这些综合得分至少提高了2倍。

English

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

通过启发式适应与超词元学习实现语言模型分词器的灵活性

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

摘要

Support