Apagamento de Token como uma Pegada de Itens de Vocabulário Implícitos em LLMs

Resumo

LLMs processam texto como sequências de tokens que correspondem aproximadamente a palavras, onde palavras menos comuns são representadas por múltiplos tokens. No entanto, os tokens individuais frequentemente não têm relação semântica com os significados das palavras/conceitos que compõem. Por exemplo, o tokenizador do Llama-2-7b divide a palavra "northeastern" nos tokens ['_n', 'ort', 'he', 'astern'], nenhum dos quais corresponde a unidades semanticamente significativas como "norte" ou "leste". Da mesma forma, os significados gerais de entidades nomeadas como "Neil Young" e expressões de várias palavras como "break a leg" não podem ser inferidos diretamente a partir de seus tokens constituintes. Mecanicamente, como os LLMs convertem tais grupos arbitrários de tokens em representações de nível superior úteis? Neste trabalho, descobrimos que as representações do último token de entidades nomeadas e palavras de vários tokens exibem um efeito de "apagamento" pronunciado, onde a informação sobre tokens anteriores e atuais é rapidamente esquecida nas camadas iniciais. Usando essa observação, propomos um método para "ler" o vocabulário implícito de um LLM autorregressivo examinando diferenças nas representações de tokens entre camadas, e apresentamos resultados desse método para Llama-2-7b e Llama-3-8B. Até onde sabemos, esta é a primeira tentativa de sondar o vocabulário implícito de um LLM.

English

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

Apagamento de Token como uma Pegada de Itens de Vocabulário Implícitos em LLMs

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

Resumo

Support