Bolmo:字节化新一代语言模型
Bolmo: Byteifying the Next Generation of Language Models
December 17, 2025
作者: Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann
cs.AI
摘要
我们推出Bolmo——首个在10亿和70亿参数规模上具备竞争力的全开放字节级语言模型系列。与以往主要关注从头训练的字节级语言模型研究不同,我们通过对现有子词级语言模型进行字节化改造来训练Bolmo。字节化技术能够突破子词分词的局限性(如字符理解能力不足、固定子词词表导致的效率约束),同时保持与领先子词级语言模型相当的性能。Bolmo专为字节化设计:我们的架构解决了以往字节级架构与子词级语言模型在表达能力上的错配问题,使得Bolmo与源子词模型之间可采用精确蒸馏目标。这种方法能以低于典型预训练1%的令牌预算实现子词级语言模型向字节级模型的转换。Bolmo在同等规模字节级模型中表现显著优于所有前人工作,并在字符理解及部分代码任务上超越源子词模型,其他任务性能也接近原始模型。此外,我们通过更高令牌压缩比训练使Bolmo实现与子词级模型相当的推理速度,并能依托源子词模型的现有生态进行低成本高效的后训练。我们的研究成果最终使字节级语言模型成为跨多种应用场景下可与子词级模型竞争的实用选择。
English
We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.