通过启发式适应与超词元学习实现语言模型分词器的灵活性
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
May 14, 2025
作者: Shaurya Sharthak, Vinayak Pahalwan, Adithya Kamath, Adarsh Shirawalmath
cs.AI
摘要
预训练语言模型(LLMs)常受限于其固定的分词方案,导致效率低下和性能受限,尤其是在多语言或专业应用场景中。这种分词器锁定现象带来了显著挑战。传统的克服方法往往需要极高的计算资源。尽管通过启发式初始化进行分词器替换旨在减轻这一负担,但现有方法通常需要进行详尽的残差微调,且可能无法完全保留语义细微差别或有效解决底层的压缩效率问题。我们的框架引入了两项创新:首先,Tokenadapt,一种模型无关的分词器移植方法;其次,新颖的多词超分词预学习技术,以增强压缩效果并减少碎片化。Tokenadapt通过结合两种方法的混合启发式策略来初始化新的唯一分词嵌入:一是基于旧分词器的子词分解进行局部估计,二是利用原始词汇表中语义最接近的前k个分词进行全局估计。该方法旨在保留语义的同时,显著减少重新训练的需求。实证研究验证了这两项贡献:移植启发式成功初始化了唯一分词,明显优于包括Transtokenizer和ReTok在内的传统基线方法和复杂方法,而我们的超分词则实现了显著的压缩增益。我们的零样本困惑度结果显示,与ReTok和TransTokenizer基线相比,TokenAdapt混合初始化在不同基础模型和新训练的目标分词器上均能持续产生更低的困惑度比率。TokenAdapt通常显著降低了整体困惑度比率,与ReTok相比,这些综合得分至少提高了2倍。
English
Pretrained language models (LLMs) are often constrained by their fixed
tokenization schemes, leading to inefficiencies and performance limitations,
particularly for multilingual or specialized applications. This tokenizer
lock-in presents significant challenges. standard methods to overcome this
often require prohibitive computational resources. Although tokenizer
replacement with heuristic initialization aims to reduce this burden, existing
methods often require exhaustive residual fine-tuning and still may not fully
preserve semantic nuances or adequately address the underlying compression
inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a
model-agnostic tokenizer transplantation method, and second, novel
pre-tokenization learning for multi-word Supertokens to enhance compression and
reduce fragmentation. Tokenadapt initializes new unique token embeddings via a
hybrid heuristic that combines two methods: a local estimate based on subword
decomposition using the old tokenizer, and a global estimate utilizing the
top-k semantically similar tokens from the original vocabulary. This
methodology aims to preserve semantics while significantly minimizing
retraining requirements. Empirical investigations validate both contributions:
the transplantation heuristic successfully initializes unique tokens, markedly
outperforming conventional baselines and sophisticated methods including
Transtokenizer and ReTok, while our Supertokens achieve notable compression
gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid
initialization consistently yields lower perplexity ratios compared to both
ReTok and TransTokenizer baselines across different base models and newly
trained target tokenizers. TokenAdapt typically reduced the overall perplexity
ratio significantly compared to ReTok, yielding at least a 2-fold improvement
in these aggregate scores.Summary
AI-Generated Summary