zip2zip：通过令牌压缩实现语言模型的推理时自适应词汇表

摘要

分词效率对大型语言模型（LLMs）的性能与成本起着至关重要的作用，然而大多数模型依赖于为通用语料库优化的静态分词器。这些分词器的固定词汇表往往难以适应特定领域或语言的输入，导致生成的标记序列更长，计算成本更高。我们提出了zip2zip框架，使LLMs能够在推理时动态调整词汇表，从而生成更少的标记，实现更快的推理速度。zip2zip包含三个核心组件：(1) 基于Lempel-Ziv-Welch (LZW)压缩算法的分词器，能够即时将标记逐步压缩为可重复使用的“超标记”；(2) 嵌入层，在运行时为新形成的超标记计算嵌入表示；(3) 一种因果语言建模变体，训练模型以处理经过超标记化压缩的序列。我们证明，通过参数高效的微调，现有LLM可在10 GPU小时内完成zip2zip化改造。改造后的zip2zip LLM在推理时能有效利用超标记，将输入输出序列长度减少20-60%，显著提升了推理延迟性能。

English

Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.

zip2zip：通过令牌压缩实现语言模型的推理时自适应词汇表

zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

摘要

Support