LLM에서 암묵적 어휘 항목의 흔적으로서의 토큰 삭제

초록

LLM(대형 언어 모델)은 텍스트를 단어에 대략적으로 대응되는 토큰 시퀀스로 처리하며, 덜 흔한 단어들은 여러 토큰으로 표현됩니다. 그러나 개별 토큰은 종종 그들이 구성하는 단어/개념의 의미와 의미적으로 관련이 없습니다. 예를 들어, Llama-2-7b의 토크나이저는 "northeastern"이라는 단어를 ['_n', 'ort', 'he', 'astern']과 같은 토큰으로 분할하는데, 이 중 어느 것도 "north"나 "east"와 같은 의미적으로 의미 있는 단위에 해당하지 않습니다. 마찬가지로, "Neil Young"과 같은 고유명사나 "break a leg"와 같은 다단어 표현의 전체 의미는 그 구성 토큰들로부터 직접 추론할 수 없습니다. 기계적으로, LLM은 어떻게 이러한 임의의 토큰 그룹을 유용한 상위 수준의 표현으로 변환할까요? 본 연구에서 우리는 고유명사와 다중 토큰 단어의 마지막 토큰 표현이 초기 레이어에서 이전 및 현재 토큰에 대한 정보가 빠르게 "지워지는" 현상을 보인다는 것을 발견했습니다. 이 관찰을 바탕으로, 우리는 레이어 간 토큰 표현의 차이를 조사하여 자기회귀적 LLM의 암묵적 어휘를 "읽어내는" 방법을 제안하고, Llama-2-7b와 Llama-3-8B에 대한 이 방법의 결과를 제시합니다. 우리가 아는 한, 이는 LLM의 암묵적 어휘를 탐구하는 첫 번째 시도입니다.

English

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

LLM에서 암묵적 어휘 항목의 흔적으로서의 토큰 삭제

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

초록

Support