快速字节潜在变换器

摘要

最近的字节级语言模型在不依赖子词词汇表的情况下达到了与词元级模型相当的性能，但其实用性受到逐字节自回归生成速度缓慢的限制。我们通过字节潜在变换器中的新训练和生成技术解决了这一瓶颈。首先，我们引入了BLT扩散模型，这是一种新模型，也是我们最快的BLT变体，它通过辅助的块级扩散目标与标准的下一个字节预测损失联合训练。这使得推理过程能够在每个解码步骤中并行生成多个字节，显著减少了生成序列所需的前向传递次数。其次，我们提出了两种受推测性解码启发的扩展方法，以部分速度换取更高的生成质量：BLT自推测，即BLT的局部解码器继续生成超出其正常分块边界的草稿字节，然后通过单次完整模型前向传递进行验证；以及BLT扩散+验证，它在基于扩散的生成后增加一个自回归验证步骤，从而增强了BLT-D。所有方法在生成任务上的估计内存带宽成本可能比BLT低50%以上。每种方法都有其独特的优势，共同消除了字节级语言模型实际应用中的关键障碍。

English

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

快速字节潜在变换器

Fast Byte Latent Transformer

摘要

Support