¿Existe un caso para los tokenizadores optimizados para conversación en los modelos de lenguaje a gran escala?

Resumen

Los costos computacionales y energéticos de los Modelos de Lenguaje de Gran Escala (LLMs, por sus siglas en inglés) han aumentado exponencialmente debido al creciente tamaño de los modelos y la adopción masiva de LLMs por cientos de millones de usuarios. El costo unitario de un LLM es el cálculo de un token. Por lo tanto, el tokenizador desempeña un papel importante en la eficiencia de un modelo, y estos se optimizan cuidadosamente para minimizar el número de tokens en el texto de su corpus de entrenamiento. Una de las aplicaciones más populares de los LLMs son los chatbots que interactúan con los usuarios. Una observación clave es que, para esos chatbots, lo importante es el rendimiento del tokenizador en el texto de entrada del usuario y en las respuestas del chatbot. Estos textos probablemente difieren del texto en el corpus de entrenamiento. Así, surge inmediatamente la pregunta de si existe un beneficio potencial en optimizar los tokenizadores para conversaciones de chatbots. En este artículo, se explora esta idea para diferentes tokenizadores utilizando un corpus de conversaciones de chatbots disponible públicamente para rediseñar sus vocabularios y evaluar su rendimiento en este dominio. Los resultados muestran que los tokenizadores optimizados para conversaciones reducen consistentemente el número de tokens en los diálogos de chatbots, lo que puede generar ahorros energéticos significativos, en el rango del 5% al 10%, mientras tienen un impacto mínimo o incluso ligeramente positivo en la eficiencia de la tokenización para el corpus de entrenamiento original.

English

The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.

¿Existe un caso para los tokenizadores optimizados para conversación en los modelos de lenguaje a gran escala?

Is There a Case for Conversation Optimized Tokenizers in Large Language Models?

Resumen

Support