大型語言模型中是否存在對話優化分詞器的應用案例?
Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
June 23, 2025
作者: Raquel Ferrando, Javier Conde, Gonzalo Martínez, Pedro Reviriego
cs.AI
摘要
大型语言模型(LLMs)的计算与能源成本,随着模型规模的不断扩大以及数亿用户的广泛采用,呈现指数级增长。LLM的单位成本体现在对单个标记(token)的计算上。因此,分词器(tokenizer)在模型效率中扮演着关键角色,它们经过精心优化,旨在最小化训练语料库中文本的标记数量。LLMs最为流行的应用之一便是与用户互动的聊天机器人。一个关键观察是,对于这些聊天机器人而言,分词器在用户输入文本及聊天机器人响应中的表现至关重要,而这些文本很可能与训练语料库中的文本存在差异。于是,一个直接引发的问题是:针对聊天机器人对话优化分词器是否具有潜在益处。本文通过利用公开可获取的聊天机器人对话语料库,重新设计不同分词器的词汇表,并评估其在此领域内的性能,深入探讨了这一设想。研究结果显示,经过对话优化的分词器在聊天机器人对话中持续减少了标记数量,这可在保持对原始训练语料库分词效率影响最小甚至略有提升的同时,实现5%至10%范围内的显著节能效果。
English
The computational and energy costs of Large Language Models (LLMs) have
increased exponentially driven by the growing model sizes and the massive
adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is
the computation of a token. Therefore, the tokenizer plays an important role in
the efficiency of a model, and they are carefully optimized to minimize the
number of tokens for the text in their training corpus. One of the most popular
applications of LLMs are chatbots that interact with users. A key observation
is that, for those chatbots, what is important is the performance of the
tokenizer in the user text input and the chatbot responses. Those are most
likely different from the text in the training corpus. So, a question that
immediately arises is whether there is a potential benefit in optimizing
tokenizers for chatbot conversations. In this paper, this idea is explored for
different tokenizers by using a publicly available corpus of chatbot
conversations to redesign their vocabularies and evaluate their performance in
this domain. The results show that conversation-optimized tokenizers
consistently reduce the number of tokens in chatbot dialogues, which can lead
to meaningful energy savings, in the range of 5% to 10% while having minimal or
even slightly positive impact on tokenization efficiency for the original
training corpus.