ChatPaper.aiChatPaper

zip2zip:基於詞彙壓縮的語言模型推理時自適應詞表技術

zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

June 1, 2025
作者: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
cs.AI

摘要

分詞效率對大型語言模型(LLMs)的性能與成本具有關鍵影響,然而大多數模型依賴於針對通用語料庫優化的靜態分詞器。這些分詞器的固定詞彙表往往無法適應特定領域或語言的輸入,導致更長的分詞序列和更高的計算成本。我們提出了zip2zip框架,使LLMs能夠在推理時動態調整分詞詞彙表,從而生成更少的分詞並實現更快的推理。zip2zip包含三個關鍵組件:(1)基於Lempel-Ziv-Welch(LZW)壓縮的分詞器,能夠在運行時逐步將分詞壓縮為可重用的“超分詞”;(2)嵌入層,在運行時計算新形成的超分詞的嵌入;(3)一種因果語言建模變體,訓練模型以操作於超分詞化的壓縮序列。我們展示了一個現有的LLM可以在10 GPU小時內通過參數高效的微調實現zip2zip化。由此產生的zip2zip LLMs在推理時有效地學會使用超分詞,將輸入和輸出序列長度減少20-60%,並顯著改善了推理延遲。
English
Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.
PDF72June 3, 2025