LLMにおける暗黙的語彙項目の痕跡としてのトークン消去

要旨

大規模言語モデル（LLM）は、テキストを単語に対応するトークンのシーケンスとして処理しますが、頻度の低い単語は複数のトークンで表現されます。しかし、個々のトークンは、それらが構成する単語や概念の意味としばしば意味的に関連していません。例えば、Llama-2-7bのトークナイザーは「northeastern」という単語を['_n', 'ort', 'he', 'astern']というトークンに分割しますが、これらのトークンは「north」や「east」のような意味的に意味のある単位に対応していません。同様に、「Neil Young」のような固有名詞や「break a leg」のような複数語表現の全体的な意味は、それらの構成トークンから直接推測することはできません。メカニズム的には、LLMはどのようにしてこのような任意のトークンのグループを有用な高レベル表現に変換するのでしょうか？本研究では、固有名詞や複数トークン単語の最後のトークン表現が、初期層において前後のトークンに関する情報が急速に「消去」される顕著な効果を示すことを発見しました。この観察を利用して、自己回帰型LLMの暗黙的な語彙を層間のトークン表現の差異を調べることで「読み取る」方法を提案し、Llama-2-7bとLlama-3-8Bに対するこの方法の結果を示します。私たちの知る限り、これはLLMの暗黙的な語彙を探る初めての試みです。

English

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

LLMにおける暗黙的語彙項目の痕跡としてのトークン消去

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

要旨

Support