zip2zip:通过令牌压缩实现语言模型的推理时自适应词汇表
zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression
June 1, 2025
作者: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
cs.AI
摘要
分词效率对大型语言模型(LLMs)的性能与成本起着至关重要的作用,然而大多数模型依赖于为通用语料库优化的静态分词器。这些分词器的固定词汇表往往难以适应特定领域或语言的输入,导致生成的标记序列更长,计算成本更高。我们提出了zip2zip框架,使LLMs能够在推理时动态调整词汇表,从而生成更少的标记,实现更快的推理速度。zip2zip包含三个核心组件:(1) 基于Lempel-Ziv-Welch (LZW)压缩算法的分词器,能够即时将标记逐步压缩为可重复使用的“超标记”;(2) 嵌入层,在运行时为新形成的超标记计算嵌入表示;(3) 一种因果语言建模变体,训练模型以处理经过超标记化压缩的序列。我们证明,通过参数高效的微调,现有LLM可在10 GPU小时内完成zip2zip化改造。改造后的zip2zip LLM在推理时能有效利用超标记,将输入输出序列长度减少20-60%,显著提升了推理延迟性能。
English
Tokenization efficiency plays a critical role in the performance and cost of
large language models (LLMs), yet most models rely on static tokenizers
optimized for general-purpose corpora. These tokenizers' fixed vocabularies
often fail to adapt to domain- or language-specific inputs, leading to longer
token sequences and higher computational costs. We introduce zip2zip, a
framework that enables LLMs to dynamically adjust token vocabulary at inference
time, allowing for fewer generated tokens and thus faster inference. zip2zip
consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch
(LZW) compression that incrementally compresses tokens into reusable
"hypertokens" on the fly; (2) an embedding layer that computes embeddings for
newly formed hypertokens at runtime; and (3) a causal language modeling variant
that trains the model to operate on hypertokenized, compressed sequences. We
show that an existing LLM can be zip2zip-fied in 10 GPU-hours via
parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to
use hypertokens at inference time, reducing input and output sequence length by
20-60\%, with significant improvements in inference latency.