大型语言模型中是否存在对话优化分词器的应用场景?
Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
June 23, 2025
作者: Raquel Ferrando, Javier Conde, Gonzalo Martínez, Pedro Reviriego
cs.AI
摘要
大型语言模型(LLMs)的计算与能源成本,随着模型规模的扩大及数亿用户的广泛采用,呈指数级增长。LLM的单位成本体现在对单个令牌的计算上。因此,分词器在模型效率中扮演着关键角色,它们经过精心优化,以最小化训练语料库中文本的令牌数量。LLMs最受欢迎的应用之一是与用户互动的聊天机器人。一个关键观察是,对于这些聊天机器人而言,分词器在用户输入文本及聊天机器人响应中的表现至关重要,而这些文本很可能与训练语料库中的文本存在差异。于是,一个直接浮现的问题是:针对聊天对话优化分词器是否具有潜在优势。本文通过利用公开可用的聊天对话语料库,重新设计不同分词器的词汇表,并评估它们在这一领域的性能,深入探讨了这一想法。结果表明,经过对话优化的分词器能持续减少聊天对话中的令牌数量,从而带来5%至10%的显著能源节约,同时对原始训练语料库的分词效率影响微乎其微,甚至略有提升。
English
The computational and energy costs of Large Language Models (LLMs) have
increased exponentially driven by the growing model sizes and the massive
adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is
the computation of a token. Therefore, the tokenizer plays an important role in
the efficiency of a model, and they are carefully optimized to minimize the
number of tokens for the text in their training corpus. One of the most popular
applications of LLMs are chatbots that interact with users. A key observation
is that, for those chatbots, what is important is the performance of the
tokenizer in the user text input and the chatbot responses. Those are most
likely different from the text in the training corpus. So, a question that
immediately arises is whether there is a potential benefit in optimizing
tokenizers for chatbot conversations. In this paper, this idea is explored for
different tokenizers by using a publicly available corpus of chatbot
conversations to redesign their vocabularies and evaluate their performance in
this domain. The results show that conversation-optimized tokenizers
consistently reduce the number of tokens in chatbot dialogues, which can lead
to meaningful energy savings, in the range of 5% to 10% while having minimal or
even slightly positive impact on tokenization efficiency for the original
training corpus.