토큰화의 한계: 토큰화의 저주

초록

언어 모델은 일반적으로 원시 텍스트를 미리 정의된 어휘집의 서브워드 식별자 시퀀스로 토큰화하는데, 이 과정은 본질적으로 오타, 길이 변이에 민감하며 토큰의 내부 구조를 거의 고려하지 않습니다. 이러한 문제를 우리는 '토큰화의 저주'라고 명명합니다. 본 연구에서는 이러한 단점들을 심층적으로 분석하고, 대형 언어 모델(LLMs)이 이러한 문제에 여전히 취약함을 입증합니다. 이 연구는 세 가지 핵심 연구 질문을 통해 이러한 도전 과제와 LLMs에 미치는 영향을 체계적으로 조사합니다: (1) 복잡한 문제 해결, (2) 토큰 구조 탐색, (3) 오타 변이에 대한 내성. 우리의 연구 결과는 모델 파라미터의 확장이 토큰화 문제를 완화할 수 있음을 보여주지만, LLMs는 여전히 오타 및 기타 텍스트 형식 변이로 인한 편향을 겪고 있음을 나타냅니다. 우리의 실험은 BPE-dropout과 같은 서브워드 정규화가 이 문제를 완화할 수 있음을 보여줍니다. 우리는 추가 연구를 촉진하기 위해 코드와 데이터를 공개할 예정입니다.

English

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.

토큰화의 한계: 토큰화의 저주

Tokenization Falling Short: The Curse of Tokenization

초록

Support