標記消除作為大型語言模型中隱式詞彙項的蹤跡

摘要

LLM（Large Language Models）將文本處理為大致對應於單詞的標記序列，其中較不常見的單詞由多個標記表示。然而，個別標記通常與其所包含的單詞/概念的含義無關。例如，Llama-2-7b的標記器將單詞"northeastern"分割為標記['_n', 'ort', 'he', 'astern']，其中沒有一個對應到像"north"或"east"這樣具有語義意義的單位。同樣地，像"Neil Young"這樣的命名實體和像"break a leg"這樣的多詞表達，其整體含義無法直接從其構成標記中推斷出來。在機械上，LLM是如何將這種任意的標記組轉換為有用的高級表示形式的？在這項工作中，我們發現命名實體和多標記單詞的最後一個標記表示呈現出明顯的"消失"效應，即在早期層中關於先前和當前標記的信息迅速被遺忘。利用這一觀察，我們提出了一種方法，通過檢查跨層標記表示之間的差異，來"讀取"自回歸型LLM的隱含詞彙，並展示了這種方法在Llama-2-7b和Llama-3-8B上的結果。據我們所知，這是首次嘗試探測LLM的隱含詞彙。

English

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

標記消除作為大型語言模型中隱式詞彙項的蹤跡

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

摘要

Support