FLEXITOKENS: 진화하는 언어 모델을 위한 유연한 토큰화 기술

초록

언어 모델(LMs)은 단순한 파인튜닝을 통해 새로운 데이터 분포에 적응하기 어려운 문제가 있습니다. 이는 하위 단어(subword) 토크나이저의 경직성 때문인데, 일반적으로 적응 과정에서 토크나이저는 변경되지 않습니다. 이러한 유연성 부족은 분포 외 도메인, 보지 못한 언어 또는 문자 체계에서 토큰화가 비효율적으로 이루어져 과도한 단편화를 초래하는 경우가 많습니다. 본 연구에서는 토큰화를 적응적으로 만들기 위해 학습 가능한 토크나이저를 갖춘 바이트 수준 언어 모델을 개발했습니다. 우리의 모델은 입력 바이트 시퀀스 간의 경계를 예측하여 이를 가변 길이 세그먼트로 인코딩하는 서브모듈을 포함합니다. 기존의 토크나이저 없는 방법은 이 경계 예측기를 훈련할 때 훈련 코퍼스 전체에 걸쳐 고정된 압축률을 강제하는 보조 손실 함수를 사용함으로써 새로운 종류의 경직성을 도입했습니다. 우리는 FLEXITOKENS라는 단순화된 훈련 목적 함수를 제안하여 적응 과정에서 훨씬 더 큰 유연성을 가능하게 합니다. 다국어 벤치마크, 형태학적으로 다양한 작업 및 도메인에 걸쳐 평가한 결과, FLEXITOKENS는 토큰 과도 단편화를 지속적으로 줄이고, 하위 단어 및 기타 그래디언트 기반 토크나이저에 비해 다운스트림 작업 성능에서 최대 10%의 개선을 달성함을 입증했습니다. 실험에 사용된 코드와 데이터는 https://github.com/owos/flexitokens에서 공개될 예정입니다.

English

Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10\% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

FLEXITOKENS: 진화하는 언어 모델을 위한 유연한 토큰화 기술

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

초록

Support