휴리스틱 적응과 슈퍼토큰 학습을 통한 언어 모델의 토크나이저 유연성 달성

초록

사전 학습된 언어 모델(LLMs)은 고정된 토큰화 방식으로 인해 비효율성과 성능 한계에 직면하는 경우가 많으며, 특히 다국어 또는 특수 목적 애플리케이션에서 이러한 문제가 두드러진다. 이러한 토큰화 방식의 고정화는 상당한 도전 과제를 제기한다. 이를 극복하기 위한 표준적인 방법들은 과도한 계산 자원을 요구하는 경우가 많다. 휴리스틱 초기화를 통한 토큰화기 교체는 이러한 부담을 줄이기 위해 시도되지만, 기존 방법들은 여전히 광범위한 잔여 미세 조정을 필요로 하며, 의미적 뉘앙스를 완전히 보존하거나 기본적인 압축 비효율성을 충분히 해결하지 못할 수 있다. 본 연구에서는 두 가지 혁신적인 방법을 제안한다: 첫째, 모델에 독립적인 토큰화기 이식 방법인 TokenAdapt와, 둘째, 다중 단어 슈퍼토큰을 위한 새로운 사전 토큰화 학습을 통해 압축 효율을 높이고 단편화를 줄이는 방법이다. TokenAdapt는 두 가지 방법을 결합한 하이브리드 휴리스틱을 통해 새로운 고유 토큰 임베딩을 초기화한다. 첫 번째 방법은 기존 토큰화기를 사용한 서브워드 분해를 기반으로 한 지역적 추정치이며, 두 번째 방법은 원래 어휘 집합에서 상위 k개의 의미적으로 유사한 토큰을 활용한 전역적 추정치이다. 이 방법론은 의미를 보존하면서도 재학습 요구 사항을 크게 최소화하는 것을 목표로 한다. 실증적 연구는 두 가지 기여를 검증한다: 이식 휴리스틱은 고유 토큰을 성공적으로 초기화하며, Transtokenizer 및 ReTok를 포함한 기존의 정교한 방법들을 크게 능가하며, 슈퍼토큰은 상당한 압축 이득을 달성한다. 제로샷 퍼플렉서티 결과는 TokenAdapt 하이브리드 초기화가 ReTok 및 TransTokenizer 기준선에 비해 다양한 기본 모델과 새로 학습된 대상 토큰화기에서 일관되게 더 낮은 퍼플렉서티 비율을 보여준다는 것을 입증한다. TokenAdapt는 일반적으로 ReTok에 비해 전체 퍼플렉서티 비율을 상당히 감소시켰으며, 이러한 종합 점수에서 최소 2배의 개선을 달성했다.

English

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

휴리스틱 적응과 슈퍼토큰 학습을 통한 언어 모델의 토크나이저 유연성 달성

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

초록

Support