ChatPaper.aiChatPaper

zip2zip:通过令牌压缩实现语言模型的推理时自适应词汇表

zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

June 1, 2025
作者: Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
cs.AI

摘要

分词效率对大型语言模型(LLMs)的性能与成本起着至关重要的作用,然而大多数模型依赖于为通用语料库优化的静态分词器。这些分词器的固定词汇表往往难以适应特定领域或语言的输入,导致生成的标记序列更长,计算成本更高。我们提出了zip2zip框架,使LLMs能够在推理时动态调整词汇表,从而生成更少的标记,实现更快的推理速度。zip2zip包含三个核心组件:(1) 基于Lempel-Ziv-Welch (LZW)压缩算法的分词器,能够即时将标记逐步压缩为可重复使用的“超标记”;(2) 嵌入层,在运行时为新形成的超标记计算嵌入表示;(3) 一种因果语言建模变体,训练模型以处理经过超标记化压缩的序列。我们证明,通过参数高效的微调,现有LLM可在10 GPU小时内完成zip2zip化改造。改造后的zip2zip LLM在推理时能有效利用超标记,将输入输出序列长度减少20-60%,显著提升了推理延迟性能。
English
Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.
PDF72June 3, 2025