大型语言模型中是否存在对话优化分词器的应用场景？

摘要

大型语言模型（LLMs）的计算与能源成本，随着模型规模的扩大及数亿用户的广泛采用，呈指数级增长。LLM的单位成本体现在对单个令牌的计算上。因此，分词器在模型效率中扮演着关键角色，它们经过精心优化，以最小化训练语料库中文本的令牌数量。LLMs最受欢迎的应用之一是与用户互动的聊天机器人。一个关键观察是，对于这些聊天机器人而言，分词器在用户输入文本及聊天机器人响应中的表现至关重要，而这些文本很可能与训练语料库中的文本存在差异。于是，一个直接浮现的问题是：针对聊天对话优化分词器是否具有潜在优势。本文通过利用公开可用的聊天对话语料库，重新设计不同分词器的词汇表，并评估它们在这一领域的性能，深入探讨了这一想法。结果表明，经过对话优化的分词器能持续减少聊天对话中的令牌数量，从而带来5%至10%的显著能源节约，同时对原始训练语料库的分词效率影响微乎其微，甚至略有提升。

English

The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.

大型语言模型中是否存在对话优化分词器的应用场景？

Is There a Case for Conversation Optimized Tokenizers in Large Language Models?

摘要

Support