透過啟發式適應與超詞元學習實現語言模型中的分詞器靈活性
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
May 14, 2025
作者: Shaurya Sharthak, Vinayak Pahalwan, Adithya Kamath, Adarsh Shirawalmath
cs.AI
摘要
預訓練語言模型(LLMs)常受制於其固定的分詞方案,導致效率低下和性能限制,特別是在多語言或專業應用中。這種分詞器的鎖定效應帶來了重大挑戰。標準的解決方法通常需要極高的計算資源。儘管通過啟發式初始化來替換分詞器旨在減輕這一負擔,但現有方法往往需要進行繁瑣的殘差微調,且可能無法完全保留語義細微差別或有效解決底層的壓縮效率問題。我們的框架引入了兩項創新:首先,Tokenadapt,一種模型無關的分詞器移植方法;其次,新穎的預分詞學習,用於多詞超詞(Supertokens)以增強壓縮並減少碎片化。Tokenadapt通過結合兩種方法的混合啟發式來初始化新的唯一詞嵌入:基於舊分詞器的子詞分解的局部估計,以及利用原始詞彙表中語義相似度最高的前k個詞的全局估計。該方法旨在保留語義的同時顯著減少重新訓練的需求。實證研究驗證了這兩項貢獻:移植啟發式成功初始化了唯一詞,明顯優於傳統基線和包括Transtokenizer和ReTok在內的複雜方法,而我們的超詞則實現了顯著的壓縮增益。我們的零樣本困惑度結果表明,與ReTok和TransTokenizer基線相比,TokenAdapt混合初始化在不同基礎模型和新訓練的目標分詞器上始終產生更低的困惑度比率。TokenAdapt通常顯著降低了總體困惑度比率,與ReTok相比,在這些綜合評分上至少實現了2倍的改進。
English
Pretrained language models (LLMs) are often constrained by their fixed
tokenization schemes, leading to inefficiencies and performance limitations,
particularly for multilingual or specialized applications. This tokenizer
lock-in presents significant challenges. standard methods to overcome this
often require prohibitive computational resources. Although tokenizer
replacement with heuristic initialization aims to reduce this burden, existing
methods often require exhaustive residual fine-tuning and still may not fully
preserve semantic nuances or adequately address the underlying compression
inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a
model-agnostic tokenizer transplantation method, and second, novel
pre-tokenization learning for multi-word Supertokens to enhance compression and
reduce fragmentation. Tokenadapt initializes new unique token embeddings via a
hybrid heuristic that combines two methods: a local estimate based on subword
decomposition using the old tokenizer, and a global estimate utilizing the
top-k semantically similar tokens from the original vocabulary. This
methodology aims to preserve semantics while significantly minimizing
retraining requirements. Empirical investigations validate both contributions:
the transplantation heuristic successfully initializes unique tokens, markedly
outperforming conventional baselines and sophisticated methods including
Transtokenizer and ReTok, while our Supertokens achieve notable compression
gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid
initialization consistently yields lower perplexity ratios compared to both
ReTok and TransTokenizer baselines across different base models and newly
trained target tokenizers. TokenAdapt typically reduced the overall perplexity
ratio significantly compared to ReTok, yielding at least a 2-fold improvement
in these aggregate scores.Summary
AI-Generated Summary