透過啟發式適應與超詞元學習實現語言模型中的分詞器靈活性

摘要

預訓練語言模型（LLMs）常受制於其固定的分詞方案，導致效率低下和性能限制，特別是在多語言或專業應用中。這種分詞器的鎖定效應帶來了重大挑戰。標準的解決方法通常需要極高的計算資源。儘管通過啟發式初始化來替換分詞器旨在減輕這一負擔，但現有方法往往需要進行繁瑣的殘差微調，且可能無法完全保留語義細微差別或有效解決底層的壓縮效率問題。我們的框架引入了兩項創新：首先，Tokenadapt，一種模型無關的分詞器移植方法；其次，新穎的預分詞學習，用於多詞超詞（Supertokens）以增強壓縮並減少碎片化。Tokenadapt通過結合兩種方法的混合啟發式來初始化新的唯一詞嵌入：基於舊分詞器的子詞分解的局部估計，以及利用原始詞彙表中語義相似度最高的前k個詞的全局估計。該方法旨在保留語義的同時顯著減少重新訓練的需求。實證研究驗證了這兩項貢獻：移植啟發式成功初始化了唯一詞，明顯優於傳統基線和包括Transtokenizer和ReTok在內的複雜方法，而我們的超詞則實現了顯著的壓縮增益。我們的零樣本困惑度結果表明，與ReTok和TransTokenizer基線相比，TokenAdapt混合初始化在不同基礎模型和新訓練的目標分詞器上始終產生更低的困惑度比率。TokenAdapt通常顯著降低了總體困惑度比率，與ReTok相比，在這些綜合評分上至少實現了2倍的改進。

English

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

透過啟發式適應與超詞元學習實現語言模型中的分詞器靈活性

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

摘要

Support