ChatPaper.aiChatPaper

Bolmo:字节化新一代语言模型

Bolmo: Byteifying the Next Generation of Language Models

December 17, 2025
作者: Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann
cs.AI

摘要

我们推出Bolmo——首个在10亿和70亿参数规模上具有竞争力的全开放字节级语言模型家族。与先前主要关注从头训练的字节级语言模型研究不同,我们通过对现有子词级模型进行字节化改造来训练Bolmo。字节化技术既能突破子词分词的局限性(如字符理解能力不足、固定子词词表导致的效率制约),又能保持与领先子词级模型相当的性能。Bolmo专为字节化改造而设计:我们的架构解决了传统字节级架构与子词级模型在表达能力上的错配问题,使得Bolmo与源子词模型之间可采用精确蒸馏目标进行高效转化。通过投入不到典型预训练1%的token预算,即可将子词级模型转换为字节级模型。Bolmo在同等规模字节级模型中表现显著优于所有前人工作,在字符理解及部分代码任务上甚至超越源子词模型,同时在其他任务上接近原始模型性能。此外,我们通过采用更高token压缩比的训练方案,使Bolmo实现与子词级模型相媲美的推理速度,并能依托源子词模型的现有生态体系进行低成本高效的后训练。我们的研究成果最终使字节级语言模型成为跨多种应用场景下可与子词级模型竞争的现实选择。
English
We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.
PDF62December 23, 2025