大型語言模型中是否存在對話優化分詞器的應用案例？

摘要

大型语言模型（LLMs）的计算与能源成本，随着模型规模的不断扩大以及数亿用户的广泛采用，呈现指数级增长。LLM的单位成本体现在对单个标记（token）的计算上。因此，分词器（tokenizer）在模型效率中扮演着关键角色，它们经过精心优化，旨在最小化训练语料库中文本的标记数量。LLMs最为流行的应用之一便是与用户互动的聊天机器人。一个关键观察是，对于这些聊天机器人而言，分词器在用户输入文本及聊天机器人响应中的表现至关重要，而这些文本很可能与训练语料库中的文本存在差异。于是，一个直接引发的问题是：针对聊天机器人对话优化分词器是否具有潜在益处。本文通过利用公开可获取的聊天机器人对话语料库，重新设计不同分词器的词汇表，并评估其在此领域内的性能，深入探讨了这一设想。研究结果显示，经过对话优化的分词器在聊天机器人对话中持续减少了标记数量，这可在保持对原始训练语料库分词效率影响最小甚至略有提升的同时，实现5%至10%范围内的显著节能效果。

English

The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.

大型語言模型中是否存在對話優化分詞器的應用案例？

Is There a Case for Conversation Optimized Tokenizers in Large Language Models?

摘要

Support