FLEXITOKENS：面向演进语言模型的灵活分词机制

摘要

语言模型（LMs）通过简单的微调来适应新数据分布具有挑战性，这主要归因于其子词分词器的固定性，这些分词器在适应过程中通常保持不变。这种不灵活性常导致分词效率低下，使得分布外领域、未见过的语言或文字被过度分割。在本研究中，我们开发了具备可学习分词器的字节级语言模型，以实现自适应的分词。我们的模型包含一个子模块，该模块学习预测输入字节序列之间的边界，并将其编码为可变长度的片段。现有的无分词器方法通过辅助损失训练这个边界预测器，该损失强制训练语料库中固定的压缩率，从而引入了新的刚性。我们提出了FLEXITOKENS，一种简化的训练目标，能够在适应过程中提供显著更大的灵活性。通过在多个多语言基准测试、形态多样的任务和领域中的评估，我们证明FLEXITOKENS持续减少了令牌的过度分割，并在下游任务性能上相比子词和其他基于梯度的分词器实现了高达10%的提升。我们的实验代码和数据将在https://github.com/owos/flexitokens 发布。

English

Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10\% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

FLEXITOKENS：面向演进语言模型的灵活分词机制

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

摘要

Support