zip2zip：基於詞彙壓縮的語言模型推理時自適應詞表技術

摘要

分詞效率對大型語言模型（LLMs）的性能與成本具有關鍵影響，然而大多數模型依賴於針對通用語料庫優化的靜態分詞器。這些分詞器的固定詞彙表往往無法適應特定領域或語言的輸入，導致更長的分詞序列和更高的計算成本。我們提出了zip2zip框架，使LLMs能夠在推理時動態調整分詞詞彙表，從而生成更少的分詞並實現更快的推理。zip2zip包含三個關鍵組件：（1）基於Lempel-Ziv-Welch（LZW）壓縮的分詞器，能夠在運行時逐步將分詞壓縮為可重用的“超分詞”；（2）嵌入層，在運行時計算新形成的超分詞的嵌入；（3）一種因果語言建模變體，訓練模型以操作於超分詞化的壓縮序列。我們展示了一個現有的LLM可以在10 GPU小時內通過參數高效的微調實現zip2zip化。由此產生的zip2zip LLMs在推理時有效地學會使用超分詞，將輸入和輸出序列長度減少20-60%，並顯著改善了推理延遲。

English

Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.

zip2zip：基於詞彙壓縮的語言模型推理時自適應詞表技術

zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

摘要

Support