zip2zip:基於詞彙壓縮的語言模型推理時自適應詞表技術
zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression
June 1, 2025
作者: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
cs.AI
摘要
分詞效率對大型語言模型(LLMs)的性能與成本具有關鍵影響,然而大多數模型依賴於針對通用語料庫優化的靜態分詞器。這些分詞器的固定詞彙表往往無法適應特定領域或語言的輸入,導致更長的分詞序列和更高的計算成本。我們提出了zip2zip框架,使LLMs能夠在推理時動態調整分詞詞彙表,從而生成更少的分詞並實現更快的推理。zip2zip包含三個關鍵組件:(1)基於Lempel-Ziv-Welch(LZW)壓縮的分詞器,能夠在運行時逐步將分詞壓縮為可重用的“超分詞”;(2)嵌入層,在運行時計算新形成的超分詞的嵌入;(3)一種因果語言建模變體,訓練模型以操作於超分詞化的壓縮序列。我們展示了一個現有的LLM可以在10 GPU小時內通過參數高效的微調實現zip2zip化。由此產生的zip2zip LLMs在推理時有效地學會使用超分詞,將輸入和輸出序列長度減少20-60%,並顯著改善了推理延遲。
English
Tokenization efficiency plays a critical role in the performance and cost of
large language models (LLMs), yet most models rely on static tokenizers
optimized for general-purpose corpora. These tokenizers' fixed vocabularies
often fail to adapt to domain- or language-specific inputs, leading to longer
token sequences and higher computational costs. We introduce zip2zip, a
framework that enables LLMs to dynamically adjust token vocabulary at inference
time, allowing for fewer generated tokens and thus faster inference. zip2zip
consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch
(LZW) compression that incrementally compresses tokens into reusable
"hypertokens" on the fly; (2) an embedding layer that computes embeddings for
newly formed hypertokens at runtime; and (3) a causal language modeling variant
that trains the model to operate on hypertokenized, compressed sequences. We
show that an existing LLM can be zip2zip-fied in 10 GPU-hours via
parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to
use hypertokens at inference time, reducing input and output sequence length by
20-60\%, with significant improvements in inference latency.