FLEXITOKENS：面向演進語言模型的靈活分詞技術

摘要

語言模型（LMs）通過簡單的微調來適應新的數據分佈具有挑戰性。這是由於其子詞分詞器的剛性，通常在適應過程中保持不變。這種不靈活性往往導致分詞效率低下，造成分佈外領域、未見語言或文字的分詞過度碎片化。在本研究中，我們開發了具有可學習分詞器的字節級語言模型，使分詞過程具有自適應性。我們的模型包含一個子模塊，該模塊學習預測輸入字節序列之間的邊界，將其編碼為可變長度的片段。現有的無分詞器方法通過使用輔助損失來訓練這個邊界預測器，該損失在訓練語料庫中強制執行固定的壓縮率，從而引入了一種新的剛性。我們提出了FLEXITOKENS，這是一種簡化的訓練目標，能夠在適應過程中實現顯著更大的靈活性。通過在多個多語言基準測試、形態多樣性任務和領域中的評估，我們證明FLEXITOKENS始終如一地減少了分詞的過度碎片化，並在下游任務性能上相比子詞和其他基於梯度的分詞器實現了高達10%的提升。我們的實驗代碼和數據將在https://github.com/owos/flexitokens 發佈。

English

Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10\% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

FLEXITOKENS：面向演進語言模型的靈活分詞技術

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

摘要

Support