令牌擦除作为LLM中隐式词汇项的痕迹

摘要

LLM以大致对应于单词的令牌序列处理文本，其中较不常见的单词由多个令牌表示。然而，个别令牌通常与它们组成的单词/概念的含义无关。例如，Llama-2-7b的分词器将单词"northeastern"分割为令牌['_n', 'ort', 'he', 'astern']，其中没有一个对应于"north"或"east"等语义上有意义的单元。同样，诸如"Neil Young"这样的命名实体和"break a leg"这样的多词表达的整体含义也不能直接从其组成令牌中推断出。从机械角度来看，LLM是如何将这种任意的令牌组转换为有用的高级表示的？在这项工作中，我们发现命名实体和多令牌单词的最后一个令牌表示呈现出明显的"擦除"效应，即在早期层中关于先前和当前令牌的信息很快被遗忘。利用这一观察结果，我们提出了一种方法，通过检查跨层的令牌表示差异来"读取"自回归LLM的隐式词汇，并展示了这种方法在Llama-2-7b和Llama-3-8B上的结果。据我们所知，这是首次尝试探究LLM的隐式词汇。

English

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

令牌擦除作为LLM中隐式词汇项的痕迹

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

摘要

Support